MoodleDocs - Wkład użytkownika [pl]

Supervised block

2014-04-02T00:46:19Z

Oasychev: /* Editing quiz settings */

The main idea of the '''supervised block''' is to have an additional control over your students, so they will be able to do something only under teacher supervision. Installed with '''supervisedcheck''' (quiz access rules plugin, included out of the box) allows you to add restrictions to your quizzes.

How it works together? The course's teacher creates the session specifying the academic group, lesson type (e.g. laboratory work, exam, etc.), classroom and duration. After that students will be able to start quizzes from this course according next conditions:
* the session is active;
* student is in an academic group for which the session was created;
* student is in session's classroom (you can specify the ip subnet for each classroom, see below);
* the session was created for the lesson type which is specified for current quiz (see editing quiz settings section).

Of course you can specify only some important quizzes to be under teacher's control and not all in the course.

Also, you can combine the supervised block with [https://docs.moodle.org/2x/pl/Auto_role_assignment_block Auto role assignment block]. It is useful to temporary change user role during the session.

== Structure and installation ==
Out of the box you have 2 Moodle plugins:
# supervised is a block for managing sessions, classrooms and lesson types. For installing, put the supervised folder into your-moodle-path/blocks/ directory.
# supervisedcheck is a quiz access rules plugin for setting up your quizzes to work with supervised block. For installing, put the supervisedcheck folder into your-moodle-path/mod/quiz/accessrule/ directory. This plugin can not be used separately and must be installed only together or after supervised block.
Go to http://your-moodle-website/admin/ to finish the installation process.

== Teacher-side ==
First of all, go to your course, turn editing on and add the supervised block to be able work with it. Now before creating your first session, you should create lesson types and classrooms (and may be add some groups to your course?).

=== Creating lesson types ===
You can omit this step if you want all you quizzes either to be accessible on all sessions (lessons) or without restrictions at all.

However, in more complex courses you want some quizzes accessible on some lessons and not on the other. Then you need to define "lesson type". Each lesson type is a kind of lesson, to which quiz can be limited - it is called "lesson type" as a teacher often can have several lessons on one theme (type) with different students groups. If you define lesson types for you course, quiz editing will allow you option to check on which lesson types quiz will be available.

In the supervised block's body click the "lesson types" link. Here you can add some:

[[Image:lessontypes.png]]

Lesson type describes just by it's name and nothing else. Consider that lesson types creates for current course, so you must create new ones for each other course.

=== Creating classrooms ===
To start session you must provide at least one classroom - i.e. a range of computers under you supervision. A range of computers can be defined by IP subnet. Ask you admin for "IP range" for you class if you don't know wha t it means, and he will came as a string you could enter there.

Each classroom describes by name and IP subnet. Additionally you can "hide" the classroom, so it will remain in system but you won't be able to create a session in this classroom.
You can specify an IP subnet in standard Moodle format. So they divided by comma and each address can be described in one on those ways:
* xxx.xxx.xxx.xxx (full IP address)
* xxx.xxx.xxx.xxx/nn (number of bits in net mask)
* xxx.xxx.xxx.xxx-yyy (a range of IP addresses in the last group)
* xxx.xxx or xxx.xxx. (incomplete address)
Example: 235.144.18, 235.144.19.2-34, 235.144.19.38, 235.144.19.40-44

[[Image:classrooms.png]]

As opposed to lesson types, you can use the same classrooms in all courses.

=== Editing quiz settings ===
For link some quizzes from course with supervised sessions, go to their settings, extra restrictions on attempts section. Here you will find three options for setting your quiz:
* Don't use supervised control for this quiz (by default) - quiz is accessible without need to be supervised.
* Quiz is accessible on any supervised session, but only on them.
* Quiz iz accessible only on sessions of some lesson types (even if you start a session for unchecked lesson type, student won't be able to start this quiz).

[[Image:qar.png]]

=== Sessions starting or planning ===
The teacher can start a new session just from supervised block's body choosing classroom, lesson type (if some of them was created), academic group and session duration:

[[Image:start_s.png]]

Teacher can also choose "All groups". In this case all students from this course will be able to start the quiz.

During the session teacher can change the classroom, academic group, duration, view session logs (see below) or finish session:

[[Image:active_s.png]]

The second option is to '''plan the session'''. In this case the manager plans the schedule for teachers and add sessions for them. The teacher can also plans sessions for himself (see also capabilities section).

[[Image:plane_s.png]]

If the option "Send e-mail" is checked, the teacher will be notified about session creation and any changes.

What happens if the teacher has a planned session near current time? The supervised block shows the notification about it and propose to start it. The teacher is able to change something before starting (classroom, group, lesson type or duration).

The teacher is able to view session logs and filter them by users took part in session.

== Student-side ==
When student is in course he can see in a supervised block if there is an active session(s):

[[Image:student_side.png]]

So, he can try to start the quiz.

== Capabilities ==
{| border="1" cellspacing="0"
|'''Capability name'''
|'''Archetypes'''
|'''Description'''
|-
|block/supervised: besupervised
|student
|User can be supervised
|-
|block/supervised: supervise
|editingteacher, teacher
|Ability to:
* start planned session
* start new session
* change / finish active session
* view active session logs
User can't plane his own sessions.
|-
|block/supervised: editclassrooms
|editingteacher, manager
|Add / edit / delete / view classrooms
|-
|block/supervised: editlessontypes
|editingteacher, manager
|Add / edit / delete / view lesson types
|-
|block/supervised: viewownsessions
|editingteacher, teacher
|View own sessions (active, planned, finished) and their logs.
User can't edit them, start or plane new.
|-
|block/supervised: viewallsessions
|manager, editingteacher
|View sessions of all teachers(active, planned, finished) and their logs.
User can't edit them, start or plane new.
|-
|block/supervised: manageownsessions
|editingteacher, teacher
|Ability for own sessions:
* plan
* view / edit / remove planned
User can't start his own sessions.
|-
|block/supervised: manageallsessions
|manager, coursecreator
|Ability for sessions of all teachers:
* plan
* view / edit / remove planned
User can't start sessions.
|-
|block/supervised: managefinishedsessions
|manager, coursecreator
| Ability to remove finished sessions.
|}

Supervised block

2014-04-02T00:44:28Z

Oasychev: /* Creating classrooms */

The main idea of the '''supervised block''' is to have an additional control over your students, so they will be able to do something only under teacher supervision. Installed with '''supervisedcheck''' (quiz access rules plugin, included out of the box) allows you to add restrictions to your quizzes.

How it works together? The course's teacher creates the session specifying the academic group, lesson type (e.g. laboratory work, exam, etc.), classroom and duration. After that students will be able to start quizzes from this course according next conditions:
* the session is active;
* student is in an academic group for which the session was created;
* student is in session's classroom (you can specify the ip subnet for each classroom, see below);
* the session was created for the lesson type which is specified for current quiz (see editing quiz settings section).

Of course you can specify only some important quizzes to be under teacher's control and not all in the course.

Also, you can combine the supervised block with [https://docs.moodle.org/2x/pl/Auto_role_assignment_block Auto role assignment block]. It is useful to temporary change user role during the session.

== Structure and installation ==
Out of the box you have 2 Moodle plugins:
# supervised is a block for managing sessions, classrooms and lesson types. For installing, put the supervised folder into your-moodle-path/blocks/ directory.
# supervisedcheck is a quiz access rules plugin for setting up your quizzes to work with supervised block. For installing, put the supervisedcheck folder into your-moodle-path/mod/quiz/accessrule/ directory. This plugin can not be used separately and must be installed only together or after supervised block.
Go to http://your-moodle-website/admin/ to finish the installation process.

== Teacher-side ==
First of all, go to your course, turn editing on and add the supervised block to be able work with it. Now before creating your first session, you should create lesson types and classrooms (and may be add some groups to your course?).

=== Creating lesson types ===
You can omit this step if you want all you quizzes either to be accessible on all sessions (lessons) or without restrictions at all.

However, in more complex courses you want some quizzes accessible on some lessons and not on the other. Then you need to define "lesson type". Each lesson type is a kind of lesson, to which quiz can be limited - it is called "lesson type" as a teacher often can have several lessons on one theme (type) with different students groups. If you define lesson types for you course, quiz editing will allow you option to check on which lesson types quiz will be available.

In the supervised block's body click the "lesson types" link. Here you can add some:

[[Image:lessontypes.png]]

Lesson type describes just by it's name and nothing else. Consider that lesson types creates for current course, so you must create new ones for each other course.

=== Creating classrooms ===
To start session you must provide at least one classroom - i.e. a range of computers under you supervision. A range of computers can be defined by IP subnet. Ask you admin for "IP range" for you class if you don't know wha t it means, and he will came as a string you could enter there.

Each classroom describes by name and IP subnet. Additionally you can "hide" the classroom, so it will remain in system but you won't be able to create a session in this classroom.
You can specify an IP subnet in standard Moodle format. So they divided by comma and each address can be described in one on those ways:
* xxx.xxx.xxx.xxx (full IP address)
* xxx.xxx.xxx.xxx/nn (number of bits in net mask)
* xxx.xxx.xxx.xxx-yyy (a range of IP addresses in the last group)
* xxx.xxx or xxx.xxx. (incomplete address)
Example: 235.144.18, 235.144.19.2-34, 235.144.19.38, 235.144.19.40-44

[[Image:classrooms.png]]

As opposed to lesson types, you can use the same classrooms in all courses.

=== Editing quiz settings ===
For link some quizzes from course with supervised sessions, go to their settings, extra restrictions on attempts section. Here you will find three options for setting your quiz:
* Don't use supervised control for this quiz (by default).
* Use supervised control for all lesson types.
* Use supervised control for custom lesson types (even if you start a session for unchecked lesson type, student won't be able to start this quiz).

[[Image:qar.png]]

=== Sessions starting or planning ===
The teacher can start a new session just from supervised block's body choosing classroom, lesson type (if some of them was created), academic group and session duration:

[[Image:start_s.png]]

Teacher can also choose "All groups". In this case all students from this course will be able to start the quiz.

During the session teacher can change the classroom, academic group, duration, view session logs (see below) or finish session:

[[Image:active_s.png]]

The second option is to '''plan the session'''. In this case the manager plans the schedule for teachers and add sessions for them. The teacher can also plans sessions for himself (see also capabilities section).

[[Image:plane_s.png]]

If the option "Send e-mail" is checked, the teacher will be notified about session creation and any changes.

What happens if the teacher has a planned session near current time? The supervised block shows the notification about it and propose to start it. The teacher is able to change something before starting (classroom, group, lesson type or duration).

The teacher is able to view session logs and filter them by users took part in session.

== Student-side ==
When student is in course he can see in a supervised block if there is an active session(s):

[[Image:student_side.png]]

So, he can try to start the quiz.

== Capabilities ==
{| border="1" cellspacing="0"
|'''Capability name'''
|'''Archetypes'''
|'''Description'''
|-
|block/supervised: besupervised
|student
|User can be supervised
|-
|block/supervised: supervise
|editingteacher, teacher
|Ability to:
* start planned session
* start new session
* change / finish active session
* view active session logs
User can't plane his own sessions.
|-
|block/supervised: editclassrooms
|editingteacher, manager
|Add / edit / delete / view classrooms
|-
|block/supervised: editlessontypes
|editingteacher, manager
|Add / edit / delete / view lesson types
|-
|block/supervised: viewownsessions
|editingteacher, teacher
|View own sessions (active, planned, finished) and their logs.
User can't edit them, start or plane new.
|-
|block/supervised: viewallsessions
|manager, editingteacher
|View sessions of all teachers(active, planned, finished) and their logs.
User can't edit them, start or plane new.
|-
|block/supervised: manageownsessions
|editingteacher, teacher
|Ability for own sessions:
* plan
* view / edit / remove planned
User can't start his own sessions.
|-
|block/supervised: manageallsessions
|manager, coursecreator
|Ability for sessions of all teachers:
* plan
* view / edit / remove planned
User can't start sessions.
|-
|block/supervised: managefinishedsessions
|manager, coursecreator
| Ability to remove finished sessions.
|}

Supervised block

2014-04-02T00:41:32Z

Oasychev: /* Creating lesson types */

The main idea of the '''supervised block''' is to have an additional control over your students, so they will be able to do something only under teacher supervision. Installed with '''supervisedcheck''' (quiz access rules plugin, included out of the box) allows you to add restrictions to your quizzes.

How it works together? The course's teacher creates the session specifying the academic group, lesson type (e.g. laboratory work, exam, etc.), classroom and duration. After that students will be able to start quizzes from this course according next conditions:
* the session is active;
* student is in an academic group for which the session was created;
* student is in session's classroom (you can specify the ip subnet for each classroom, see below);
* the session was created for the lesson type which is specified for current quiz (see editing quiz settings section).

Of course you can specify only some important quizzes to be under teacher's control and not all in the course.

Also, you can combine the supervised block with [https://docs.moodle.org/2x/pl/Auto_role_assignment_block Auto role assignment block]. It is useful to temporary change user role during the session.

== Structure and installation ==
Out of the box you have 2 Moodle plugins:
# supervised is a block for managing sessions, classrooms and lesson types. For installing, put the supervised folder into your-moodle-path/blocks/ directory.
# supervisedcheck is a quiz access rules plugin for setting up your quizzes to work with supervised block. For installing, put the supervisedcheck folder into your-moodle-path/mod/quiz/accessrule/ directory. This plugin can not be used separately and must be installed only together or after supervised block.
Go to http://your-moodle-website/admin/ to finish the installation process.

== Teacher-side ==
First of all, go to your course, turn editing on and add the supervised block to be able work with it. Now before creating your first session, you should create lesson types and classrooms (and may be add some groups to your course?).

=== Creating lesson types ===
You can omit this step if you want all you quizzes either to be accessible on all sessions (lessons) or without restrictions at all.

However, in more complex courses you want some quizzes accessible on some lessons and not on the other. Then you need to define "lesson type". Each lesson type is a kind of lesson, to which quiz can be limited - it is called "lesson type" as a teacher often can have several lessons on one theme (type) with different students groups. If you define lesson types for you course, quiz editing will allow you option to check on which lesson types quiz will be available.

In the supervised block's body click the "lesson types" link. Here you can add some:

[[Image:lessontypes.png]]

Lesson type describes just by it's name and nothing else. Consider that lesson types creates for current course, so you must create new ones for each other course.

=== Creating classrooms ===
Each classroom describes by name and IP subnet. Additionally you can "hide" the classroom, so it will remain in system but you won't be able to create a session in this classroom.
You can specify an IP subnet in standard Moodle format. So they divided by comma and each address can be described in one on those ways:
* xxx.xxx.xxx.xxx (full IP address)
* xxx.xxx.xxx.xxx/nn (number of bits in net mask)
* xxx.xxx.xxx.xxx-yyy (a range of IP addresses in the last group)
* xxx.xxx or xxx.xxx. (incomplete address)
Example: 235.144.18, 235.144.19.2-34, 235.144.19.38, 235.144.19.40-44

[[Image:classrooms.png]]

As opposed to lesson types, you can use the same classrooms in all courses. For start session you must provide at least one classroom. Consider, if you create classroom with an incorrect IP subnet, students won't be able to start the quiz.

=== Editing quiz settings ===
For link some quizzes from course with supervised sessions, go to their settings, extra restrictions on attempts section. Here you will find three options for setting your quiz:
* Don't use supervised control for this quiz (by default).
* Use supervised control for all lesson types.
* Use supervised control for custom lesson types (even if you start a session for unchecked lesson type, student won't be able to start this quiz).

[[Image:qar.png]]

=== Sessions starting or planning ===
The teacher can start a new session just from supervised block's body choosing classroom, lesson type (if some of them was created), academic group and session duration:

[[Image:start_s.png]]

Teacher can also choose "All groups". In this case all students from this course will be able to start the quiz.

During the session teacher can change the classroom, academic group, duration, view session logs (see below) or finish session:

[[Image:active_s.png]]

The second option is to '''plan the session'''. In this case the manager plans the schedule for teachers and add sessions for them. The teacher can also plans sessions for himself (see also capabilities section).

[[Image:plane_s.png]]

If the option "Send e-mail" is checked, the teacher will be notified about session creation and any changes.

What happens if the teacher has a planned session near current time? The supervised block shows the notification about it and propose to start it. The teacher is able to change something before starting (classroom, group, lesson type or duration).

The teacher is able to view session logs and filter them by users took part in session.

== Student-side ==
When student is in course he can see in a supervised block if there is an active session(s):

[[Image:student_side.png]]

So, he can try to start the quiz.

== Capabilities ==
{| border="1" cellspacing="0"
|'''Capability name'''
|'''Archetypes'''
|'''Description'''
|-
|block/supervised: besupervised
|student
|User can be supervised
|-
|block/supervised: supervise
|editingteacher, teacher
|Ability to:
* start planned session
* start new session
* change / finish active session
* view active session logs
User can't plane his own sessions.
|-
|block/supervised: editclassrooms
|editingteacher, manager
|Add / edit / delete / view classrooms
|-
|block/supervised: editlessontypes
|editingteacher, manager
|Add / edit / delete / view lesson types
|-
|block/supervised: viewownsessions
|editingteacher, teacher
|View own sessions (active, planned, finished) and their logs.
User can't edit them, start or plane new.
|-
|block/supervised: viewallsessions
|manager, editingteacher
|View sessions of all teachers(active, planned, finished) and their logs.
User can't edit them, start or plane new.
|-
|block/supervised: manageownsessions
|editingteacher, teacher
|Ability for own sessions:
* plan
* view / edit / remove planned
User can't start his own sessions.
|-
|block/supervised: manageallsessions
|manager, coursecreator
|Ability for sessions of all teachers:
* plan
* view / edit / remove planned
User can't start sessions.
|-
|block/supervised: managefinishedsessions
|manager, coursecreator
| Ability to remove finished sessions.
|}

Formal Languages Block

2013-12-28T19:05:51Z

Oasychev: /* User interface */

== Authors ==

# Idea, string analysis method, general architecture and architecture implementation - Oleg Sychev
# Implementation of built-in language - Dmitry Mamontov

== Description ==

The goal of formal languages block is to provide an API for managing formal languages - a well-known mathematical formalism, that defines a set of strings, constrained by rules.

'''Сurrent status'''

{| class="wikitable"
|-
! Feature
! Description
! Status
|-
| Scanning
| Breaks string into tokens, that can be useful in further analysis
| Implemented
|-
| Parsing
| Constructs [http://en.wikipedia.org/wiki/Abstract_syntax_tree abstract syntax tree] from a string, allowing deep analysis of string or even performing evaluation
| Not implemented
|-
| Managing formal languages
| Currently C, C++, printf format string scanners are implemented. Also implemented simple lexer for english language. User-defined lexers and parser are going to be implemented in next releases.
| Implemented partially
|}

''' Current language implementation status'''

{| class="wikitable"
|-
! Language
! Scanning
! Parsing
|-
| Simple english
| Implemented
| Not implemented
|-
| C
| Implemented
| Not implemented
|-
| C++
| Implemented
| Implemented partially
|-
| printf formatting string
| Implemented
| Not implemented
|}

== Installation ==

To work Formal languages block needs some additional components. They all need to be installed in order for question to work.

You need to install question type POASquestion ([https://moodle.org/plugins/view.php?plugin=qtype_poasquestion qtype_poasquestion]), which is abstract (i.e. not showing as real question), but contains useful code for scanning and working with Unicode strings.

== User interface ==

[[{{ns:file}}:block_formal_langs.PNG]]

After adding formal languages block to course, a teacher could manage language visibilty by clicking on eye icons. The dimmed languages with disabled visibility will not be shown in CorrectWriting or other plugins, that use API of formal language block.

Note, that language visibility can be inherited from site settings. So, when site setting is applied to current course, user will see label "(Site)" before name of language, while "(Course)" label shows that setting applied on course level. If the language visibility for the site and course level are equal, language in this course assumed to be set to site-level visibility, and will be changed when the course-level visibility changes. If site visibility differs from course, it is assumed independent and will not change with the site-level changes.

Administrator also, can edit global visibility of formal languages, using global settings, located in administrator menu (see pictures for details).

[[{{ns:file}}:block_formal_langs_global_settings_link.PNG]]

A central part of admin page will look just like below.

[[{{ns:file}}:block_formal_langs_global_settings_page.PNG]]

When changing site-level language visibility, admin is shown the list of courses, affected by this change.

== API for developers ==

A main block class - '''block_formal_langs''' provides a two simple functions, that might be useful for someone, that want to use our API to perform string scanning (see CorrectWriting question type examples to see how it works).

#'''block_formal_langs::available_langs''' - returns array of languages, that could be used in current context. Receives current context ID.
#'''block_formal_langs::lang_object''' - returns language object by language ID.

After getting language object, you can use '''create_from_string''' or '''create_from_db''' to scan string without referring to descriptions in database or refer to them. This will return a special object, that can be used for working with lexemes.

If $a is object, returned by '''create_from_string''' or '''create_from_db''', you can use $a->string to get scanned string, $a->stream->tokens to return array of scanned lexemes, $a->stream->errors to get array of errors.

Using scanned lexemes, you could use '''type()''' method to obtain special type of lexeme, or '''value()''' to obtain a semantic value for lexeme.

'''Example:'''

'''1. You can use following code to print names and ids of all available languages in system context.'''

<code php>
$langs = block_formal_langs::available_langs( context_system::instance()->id );

foreach($langs as $id => $name)

{

echo 'id :' . $id . ' name: ' . $name . PHP_EOL;

}
</code>

'''2. You can use following code to print all tokens' text, scanned a string with C language lexer.'''

<code php>
$lang = block_formal_langs::lang_object(2); // Id for C programming language in most databases will be 2

$string = $lang->create_from_string('int a;');

$tokens = $string->stream->tokens;

foreach($tokens as $token)

{

echo $token->value() . PHP_EOL;

}
</code>

[[Category: Block]][[Category: Contributed code]]

Formal Languages Block

2013-12-28T19:01:47Z

Oasychev: /* Installation */

== Authors ==

# Idea, string analysis method, general architecture and architecture implementation - Oleg Sychev
# Implementation of built-in language - Dmitry Mamontov

== Description ==

The goal of formal languages block is to provide an API for managing formal languages - a well-known mathematical formalism, that defines a set of strings, constrained by rules.

'''Сurrent status'''

{| class="wikitable"
|-
! Feature
! Description
! Status
|-
| Scanning
| Breaks string into tokens, that can be useful in further analysis
| Implemented
|-
| Parsing
| Constructs [http://en.wikipedia.org/wiki/Abstract_syntax_tree abstract syntax tree] from a string, allowing deep analysis of string or even performing evaluation
| Not implemented
|-
| Managing formal languages
| Currently C, C++, printf format string scanners are implemented. Also implemented simple lexer for english language. User-defined lexers and parser are going to be implemented in next releases.
| Implemented partially
|}

''' Current language implementation status'''

{| class="wikitable"
|-
! Language
! Scanning
! Parsing
|-
| Simple english
| Implemented
| Not implemented
|-
| C
| Implemented
| Not implemented
|-
| C++
| Implemented
| Implemented partially
|-
| printf formatting string
| Implemented
| Not implemented
|}

== Installation ==

To work Formal languages block needs some additional components. They all need to be installed in order for question to work.

You need to install question type POASquestion ([https://moodle.org/plugins/view.php?plugin=qtype_poasquestion qtype_poasquestion]), which is abstract (i.e. not showing as real question), but contains useful code for scanning and working with Unicode strings.

== User interface ==

[[{{ns:file}}:block_formal_langs.PNG]]

After adding formal languages block to course, a teacher could manage language visibilty by clicking on eye icons. The dimmed languages with disabled visibility will not be shown in CorrectWriting or other plugins, that use API of formal language block.

Note, that language visibility can be inherited from site settings. So, when site setting is applied to current course, user will see label "(Site)" before name of language, and "(Course)" if setting is applied to course. Note, that block tries to keep settings in database as short as possible, so if current course setting for language matches site setting for langugage, it will be removed and visibility will be taken from site visibility setting. If it does not match, setting will be applied on course level.

Administrator also, can edit global visibility of formal languages, using global settings, located in administrator menu (see pictures for details).

[[{{ns:file}}:block_formal_langs_global_settings_link.PNG]]

A central part of admin page will look just like below.

[[{{ns:file}}:block_formal_langs_global_settings_page.PNG]]

== API for developers ==

A main block class - '''block_formal_langs''' provides a two simple functions, that might be useful for someone, that want to use our API to perform string scanning (see CorrectWriting question type examples to see how it works).

#'''block_formal_langs::available_langs''' - returns array of languages, that could be used in current context. Receives current context ID.
#'''block_formal_langs::lang_object''' - returns language object by language ID.

After getting language object, you can use '''create_from_string''' or '''create_from_db''' to scan string without referring to descriptions in database or refer to them. This will return a special object, that can be used for working with lexemes.

If $a is object, returned by '''create_from_string''' or '''create_from_db''', you can use $a->string to get scanned string, $a->stream->tokens to return array of scanned lexemes, $a->stream->errors to get array of errors.

Using scanned lexemes, you could use '''type()''' method to obtain special type of lexeme, or '''value()''' to obtain a semantic value for lexeme.

'''Example:'''

'''1. You can use following code to print names and ids of all available languages in system context.'''

<code php>
$langs = block_formal_langs::available_langs( context_system::instance()->id );

foreach($langs as $id => $name)

{

echo 'id :' . $id . ' name: ' . $name . PHP_EOL;

}
</code>

'''2. You can use following code to print all tokens' text, scanned a string with C language lexer.'''

<code php>
$lang = block_formal_langs::lang_object(2); // Id for C programming language in most databases will be 2

$string = $lang->create_from_string('int a;');

$tokens = $string->stream->tokens;

foreach($tokens as $token)

{

echo $token->value() . PHP_EOL;

}
</code>

[[Category: Block]][[Category: Contributed code]]

Preg question type

2013-10-12T20:01:43Z

Oasychev: /* Authoring tools */ - updated to the RC state

{{Questions}}Preg is a question type that uses regular expressions (regexes) to check student's responses (though you can use it without regexes for its hinting features). Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. [[#Ways to use Preg questions and this docs|First section]] should guide you to using of this docs, please use it with discretion. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm. There are many good regex manuals, I'm not going to repeat them here.

Authors:
# Idea, design, question type and behaviours code, hinting, error reporting, regular expression testing (authoring tool) - Oleg Sychev.
# Regex parsing, NFA regex matching engine, matchers testing, backup&restore, unicode support, selection in regex text (in authoring tools) - Valeriy Streltsov.
# DFA regex matching engine (deprecated for now) - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Syntax tree (authoring tool) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time.
Thanks to:
* Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type;
* Tim Hunt - for his polite and useful answers and commentaries that helped writing this question, also for joint work on extra_question_fields and extra_answer_fields code, that is useful to many question type developers;
* Bondarenko Vitaly - for conversion of a vast set of regular expression matching tests.
You, too, could aways [[#The ways to give back|help us]] a lot - regardless of the way you use Preg and your capabilities.

==Ways to use Preg questions and this docs==

===I don't (want to) know anything about regular expressions but next word (character) hinting seems useful===
Then you can use Preg question type just as Shortanswer with advanced hinting, without any knowledge about regular expressions. To do this, you need to choose
* '''Notation''' => Moodle shortanswer
* '''Engine''' => Non-deterministic finite state automata
* '''Exact matching''' => Yes

After that, you can just copy answers from you shortanswer questions. You may want to read the section about [[#Hinting|hinting]] to understand more about hinting settings.

===I have a vague knowledge of regular expressions, but want to use pattern matching===
If writing regular expressions is hard for you, but you want to use their strength as patterns, authoring tools may help you a lot to create your questions. The tools show you the meaning of your regex in different ways: internal structure of the expression (syntax tree), visual path of matching (explaining graph) and a text description. They also allow you to test you regex against several strings and see if it works as expected. Experiment and play with your regexes, see corresponding changes in the authoring tools, and eventually you'll get the regex you want.

Read the section on [[#Authoring tools|authoring tools]], than (probably after some experimenting with tools on your own) a start of section about [[#Understanding regular expressions|understanding regular expressions]] (this is optional, but may be interesting and help a lot). You should also read a section about [[#How Preg questions work|question working]] to better understand various settings and how they affects you questions.

===I can make some effort to learn regular expressions well and be able to do anything they allow===
Well, you don't know regexes but want to understand them and create complex expressions easily. Then, instad of blunt trying, you better spend some time and effort reading and understanding [[#Understanding regular expressions|this section]]. Then read slightly about [[#Authoring tools|authoring tools]] and use them to experiment creating regexes. With these tools you can see if you really understand them well and they behave as expected. Syntax tree may be especially useful when you try to get the right meaning of ''precedence'' and ''arity''. After you understand the principles of regexes well, read sections about [[#How Preg questions work|question working]] and [[#Regular expressions reference|regular expression reference]] (to know your possibilities, don't bother to understand or remember them all - just look there periodically for something new to learn). Now you should be able to write regexes without much use of authoring tools, except the testing tool to test your expressions.

===I know regular expressions well enought to write them on my own without further guidance===
You should read about [[#How Preg questions work|question working]] to understand various settings and question behaviour under them. You also may be interested in regex testing in the [[#Authoring tools|authoring tools]] section. Finally, [[#Regular expressions reference|regular expression reference]] may be of some use for you.

==How Preg questions work==
Basically, this question type is an extended version of Shortanswer. It extends its features in several different ways (you could use them in almost any combination):
* '''Pattern matching''' - using regular expressions you can create powerful patterns describing possible students answers
* '''Hinting''' - when students are stuck doing the question, you may allow them to ask for a next correct word (lexem) or a character (with possible penalty)

===Settings affecting question work===
Sets the case sensitivity for all regular expressions you specify as answers. Note that you can also [[#Local case-sensitivity modifiers|set the case sensitivity for regex parts]].

'''Exact matching''' affects the question in the following way:
; ''Yes'' : the ''entire'' student's response, from the first to the last letter, should match your regular expression
; ''No'' : student's response can just contain a ''part'' that matches your regex: for example, if the correct answer is "whole" then "the whole string" will be a correct student response

You still can set some of your regexes to match the whole student's response using [[#Anchoring|special regex syntax]].

'''Notations''' specify the "language" of your answers.
; ''Regular expression'' : a usual notation for regular expression. Precisely it is Perl-compatible regex dialect. You may write regex on multiple strings for better reading - line breaks will be ignored.
; ''Regular expression (extended)'' : useful for really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not inside character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything after '#' character untill the end of string is treated as commentary (# should not be escaped and should not be inside a character class).
; ''Moodle shortanswer'' : use it to avoid regex syntax at all: just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting features. You can skip all that is said on regexes there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

'''Matching engine''' specifies the program module that performs the regex matching. There is no 'best' matching engine - it depends on the features you want to use. Engines have different stability and offer different features to use.

; ''PHP preg extension'' : should be used when you '''don't need hinting''' and '''other engines are rejecting you expressions''' as too difficult or you encounter bugs in them. It is based on the native PHP preg_ functions. It supports 100% perl-compatible regex features, it is very stable and thoroughly tested. But it doesn't support partial matching, so (unless we storm PHP developers to add support of partial matching) there is '''no hinting'''. However it supports subpattern capturing. Choose it when you need complex regex features that other engines don't support.
; ''Non-deterministing finite state automata(NFA)'' : can be used to '''perform hinting''' for your students. NFA engine is a custom PHP code, it allows many (but not all) regex features and is thoroughly tested (it passes all tests from AT&T testregex suite and most tests from PCRE testinput1 suite for the features it supports, which means quite much), but still may contain bugs in rare cases. Unsupported features for now are lookaround assertions, recursion and conditional subpatterns.
; ''Deterministic finite state automata (DFA)'' : WARNING - this engine lacking support the past year. Use NFA engine instead if you can (i.e. don't get rejected regexes that this engine accepts), it allows almost anything DFA engine does, but NFA engine much more tested and stable.

===Hinting===
Hinting is supported by NFA and DFA engines in adaptive and interactive behaviours.

====Partial matching====
Hinting starts with '''partial matching'''. By partially correct response we understand a string that starts with correct characters (matching your regex) but on some character the match breaks. Assume you entered the regex
"'''are blue, white(,| and) red'''"
and a student answered
"they are blue, vhite and red"
In this situation the partial match is
"are blue, "
Note that the regex is unanchored ("Exact match" is set to "No") so the match may not start with the first character of the student's response (like in the example above: "they " is skipped). While using just partial matching the student will see the correct and incorrect parts:
they are blue, vhite and red

====General hinting rules====
Preg question type doesn't add hinted characters to the student's response (unlike the REGEXP question type), showing it separately instead for a number of reasons:
# It is student's responsibility whether he wants to add hinted character to the his response (and some more possibly).
# It slightly facilitates thinking about a hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not usually a desirable behaviour.
When possible, hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: "," or " " (leading to the " and" path). The question will choose "," because it leads to the shortest path to complete the match, while " " leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot the other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with grades from 50% to 100% will be used for hinting and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match with the student response.

====Next character hinting====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set the hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

====Next lexem (word) hinting====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language it can be a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show student either completion of the current lexem (if partial match ends inside it) or next one (if student complete the current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports these languages (there will be more):
; ''simple english'' : english language scanner recognize words, numbers and punctuation marks;
; ''C/C++ language'' : a programming language C (or C++);
; ''printf language'' : a special language for formatting strings in C/C++ programming language, you will have it disabled probably.

Administrator of the site can control what languages are available to the teachers, to avoid confusion. See the settings of the block "Formal languages" in the plugin settings menu.

Note that "lexem" typically isn't the word you would like your students to see on the hinting button. Each language define their own word for it. You can enter another word in the question description, if you don't like default ones.

===Subpattern capturing and feedback===
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' will be replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work in the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do this, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Looking for missing and misplaced things===
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's not problem but a feature.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and its parts), and test it. Authoring tools are activated by pressing the "edit" icon near the regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
; '''syntax tree''' : shows you the inner structure of regular expressions
; '''explaining graph''' : shows you how your expression will work in a graphical way
; '''description''' : formulates the meaning of your expression in English
; '''testing tool''' : allows you to enter strings and see how they match your regex

===Installation note and known technical issues===
To have ''syntax tree'' and ''explaining graph'' tools working you (or your site admin) have to install Graphviz[http://www.graphviz.org/Graphviz] on the server and fill the 'pathtodot' setting on you Moodle installation at Site Administration > Server > System Paths. Graphviz is used to draw pictures for you.

Syntax tree and explaining graph may not work correctly in old Opera versions - for some reason the images are not updated on user actions. Fortunately, there's a newer version 16 for Windows which works with authoring tools pretty well. On Linux you will have to use something else.

===Regular expression area===
Here you can edit your regular expression. Clicking on "Show" sends the regex to all tools - syntax tree, explaining graph, description and testing results will be updated. "Save" closes the authoring tools form and saves the regex and test strings in the main question editing form. "Cancel" closes the authoring tools form and discards all changes made there.

You can select part of regular expression there, and corresponding parts of syntax tree, explaining graph, description and matched part of the strings will be highlighted. It is possible to select part of regex text, that doesnt correspond with a logically completed part of regular expression. In that case you selection will be widened to the nearest logically completed part.

===Matching options===
There you can change options, affecting you matching - matching engine, regex notation, exact matching, and case sensitivity.
* '''Matching engine''' will change the code, performing matching - you could use Testing tool to see if it suit you needs.
* '''Regular expression notation''' will change the way regexes are written - all instruments will show you the difference how this notation is interpreted.
* '''Case sensitivity''' will affect basic case sensitivity of expression, the results you could see in the graph - case insensitive nodes are gray, case sensitive - white.
* '''Exact matching''' will add new parts to the you regexp to ensure the entire student's response will match with it. These added parts will be shown on gray background in the tools - see picture below.
[[Image:qtype preg authortools9.png|exact matching]]

===Syntax tree===
As was said above, regular expressions, like all expressions, are trees of operators and operands. Syntax tree shows the inner structure of expression graphically: what is inside what. This will be the most useful if you know how to understand regular expressions or [[#Understanding regular expressions|learning to do this]].

If you don't understand operators and precedence conception well, it may have a small meaning to you. But it is still useful to find out, where you need parentheses: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.

[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown in green rectangle. You can select nodes of the tree to by pressing on them.

[[Image:qtype preg authortools3.jpg|part of the tree is selected]]

The tree will show you names and numbers of all subexpressions (subpatterns), so you can check their numerations - and back references to it.

[[Image:qtype preg authortools8.png|numbered and named subexpressions in tree]]

===Explaining graph===
The graph shows how regular expression works. Its nodes are matched characters, its edges show paths throught the nodes from the beginning to the end.
[[Image:qtype preg authortools4.png|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes are shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers which part of the string matched each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember subexpressions, you may use (?: ) instead of ( ) parentheses, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

Green rectangle shows you selected part of expression.
[[Image:qtype preg authortools11.png|selection in the tree and graph ]]

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work. Selected part of the expression will be shown by yellow background color.

===Testing tool===
You can enter a set of strings there, one per line. These strings will be matched against your expression. You'll see coloured strings, showing which parts of your strings matched the expression, so you can test if it works as you expected. You will also see green check marks for the strings that match entire regular expressions (and will be graded for that regex) and red crosses for the strings that don't give full match. PHP preg matcher can't show partial matches, so it only shows full matches or nothing (to not mislead you that entire string is wrong).

If you selected a part of regex, you will be able to see what part of strings matches that part (usually in yellow color, but that may depend on you theme). NFA matcher will show that for any part of regex, PHP preg matcher - only for capturing subexpressions (subpatterns).

The strings for testing will be saved in database, if you save regex (they will be lost if you close window with "cancel" button) and (later) question.

==Understanding regular expressions==

===Understanding expressions in general===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or a sets of characters, that is allowed in particular position. '''A''' is a regular expressions that matches a single character 'A'. The '''operators''' in regular expressions define a way to combine individual characters in the pattern: sequence (''concatenation'' operator), alternative and repeating (it is called ''quantifier''). The concatenation is so simple operator, that it doesn't have any character for it at all - just write some characters in sequence, and they'll be concatenated. But it is still have precedence, so that the question can see, did you want to repeat a single character or a sequence of them. Alternative is written as vertical bar. There are many form of quantifiers - most commonly used are question mark (repeat zero or one times), asterisk (zero or more times) and plus (one or more times). You may specify mininimum and maximum number of repeats in curly braces - this is a quantifier too.

The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

Character classes allows you to specify several possible characters for one place. They can be defined in many different ways: by enumeration of characters in square brackets '''[as3]''', by ranges in square brackets '''[a-z]''', by special sequences ('''\d''' means any digit, '''\W''' anything except a letter, digit and underscore, '''<nowiki>[[:alpha:]]</nowiki>''' any letter etc). An important type of operand is a ''simple assertions'': they allow you to test some conditions - start of the string '''^''', end of the string '''$''' or word border '''\b'''.

You could find a list and more examples of operands and operators in [[#Regular expressions reference|reference]] section.

===Precedence and order of evaluation===
A '''quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "many times*" matches "manytime" followed by zero or more "s";
#* "(many times)*" matches "many times" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "first|second|third" matches "first" or "second" or "third";
#* "(first |second |)part" matches "first part" or "second part" or just "part" - typical use of an empty alternative (note that space is in alternative to not require it before just "part");
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "first|second*" matches "first" or "secon" followed by zero or more "d" like "secondddddd";
#* "(first|second)*" matches "first" or "second", repeated zero or more time in any order, like "firstsecondfirstfirst". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(1|2){2}" matches "11" or "12" or "21" or "22", not just "11" or "22";
#* "1{2}|2{2}" matches "11" or "22" only.

An internal structure of regular expression can be viewed well on the [[#Syntax tree|syntax tree]] (authoring tool). The operators that executed first are placed lower on the tree (or to the right on horizontal view), the operator that executed last is the root of the tree. You can compare tree and explaining graphs for the examples above in authoring tools if this section doesn't seems too clear to you. Remember, that "execution" of regular expression operator means linking them in the string: sequental, alternative linking, or repeating.

===Anchoring===
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^start|end$" will match "start" from the start of the string or "end" at the end of it;
* "^(start|end)$" using brackets to match exactly with "start" or "end";
* "^start$|^end$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

==Regular expressions reference==

===Operands===
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#* full list of characters needs escaping '''\ ^ $ . [ ] | ( ) ? * + { }'''
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

===Operators===
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

===Subpatterns and backreferences===
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

====Duplicate subpattern numbers and names====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

===Complex assertions===
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

===Local case-sensitivity modifiers===
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Error reporting===
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

==The ways to give back==
This project is free software, so it's hard to get any feedback. You shouldn't expect to get software which ideally suits you needs without telling anyone about these needs, or encouragement, or some non-difficult support to the authors. Sometimes as little as writing where you work and how you use (or what prevents you from using) Preg question type may help a lot.

This software is considered a scientific project and such things could be really useful and appreciated:
* an evidence that the results of our work (i.e. Preg questoin type) are really useful to people and were used in production environment;
* a cooperative work to research it's effectiveness for various applications - basically you need to write about how you use Preg and make some survey with you teachers and/or students about it - but it can include co-authoring a conference thesis or a journal article;
* cooperating in writing article or help publishing it in English-language journals (information and help in grants for further work is welcome too).

If you consider any way of helping, do not hesitate to write me about it and ask any questions about details. You may receive individual help during such work too (for example, doing cooperative research I may give you tips how to improve you regexes, etc).

I am a high school teacher, researcher and programmer who must do much on his main paid job and have not much free time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this though. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts - there are many settings in the question, and regexes can be quite complex, so it's hard to do all testing by developers themselves.

==Development plans==
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Improve a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Plik:qtype preg authortools9.png

2013-10-12T19:54:25Z

Oasychev: Exact matching.

Exact matching.

Plik:qtype preg authortools11.png

2013-10-12T19:39:51Z

Oasychev: Selection in the tree and graph.

Selection in the tree and graph.

Plik:qtype preg authortools8.png

2013-10-12T19:31:32Z

Oasychev: numbered and named subexpressions in tree

numbered and named subexpressions in tree

Plik:qtype preg authortools71.png

2013-10-12T19:26:22Z

Oasychev: Oasychev uploaded a new version of "File:qtype preg authortools71.png": A new tree etc.

Plik:qtype preg authortools6.jpg

2013-10-12T19:24:50Z

Oasychev: Oasychev uploaded a new version of "File:qtype preg authortools6.jpg": A new quantifier title

Plik:qtype preg authortools6.jpg

2013-10-12T19:23:59Z

Oasychev: Oasychev uploaded a new version of "File:qtype preg authortools6.jpg": A new title for quantifier

Plik:qtype preg authortools4.png

2013-10-12T19:21:09Z

Oasychev: Oasychev uploaded a new version of "File:qtype preg authortools4.png": A smaller version, without unnecessary fields.

Plik:qtype preg authortools3.jpg

2013-10-12T19:15:18Z

Oasychev: Oasychev uploaded a new version of "File:qtype preg authortools3.jpg": A new version of selection.

Plik:qtype preg authortools2.png

2013-10-12T19:12:14Z

Oasychev: Oasychev uploaded a new version of "File:qtype preg authortools2.png": Updated quantifier title

Preg question type

2013-10-11T17:21:39Z

Oasychev: /* Authoring tools */

{{Questions}}Preg is a question type that uses regular expressions (regexes) to check student's responses (though you can use it without regexes for its hinting features). Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. [[#Ways to use Preg questions and this docs|First section]] should guide you to using of this docs, please use it with discretion. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm. There are many good regex manuals, I'm not going to repeat them here.

Authors:
# Idea, design, question type and behaviours code, hinting, error reporting, regular expression testing (authoring tool) - Oleg Sychev.
# Regex parsing, NFA regex matching engine, matchers testing, backup&restore, unicode support, selection in regex text (in authoring tools) - Valeriy Streltsov.
# DFA regex matching engine (deprecated for now) - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Syntax tree (authoring tool) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time.
Thanks to:
* Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type;
* Tim Hunt - for his polite and useful answers and commentaries that helped writing this question, also for joint work on extra_question_fields and extra_answer_fields code, that is useful to many question type developers;
* Bondarenko Vitaly - for conversion of a vast set of regular expression matching tests.
You, too, could aways [[#The ways to give back|help us]] a lot - regardless of the way you use Preg and your capabilities.

==Ways to use Preg questions and this docs==

===I don't (want to) know anything about regular expressions but next word (character) hinting seems useful===
Then you can use Preg question type just as Shortanswer with advanced hinting, without any knowledge about regular expressions. To do this, you need to choose
* '''Notation''' => Moodle shortanswer
* '''Engine''' => Non-deterministic finite state automata
* '''Exact matching''' => Yes

After that, you can just copy answers from you shortanswer questions. You may want to read the section about [[#Hinting|hinting]] to understand more about hinting settings.

===I have a vague knowledge of regular expressions, but want to use pattern matching===
If writing regular expressions is hard for you, but you want to use their strength as patterns, authoring tools may help you a lot to create your questions. The tools show you the meaning of your regex in different ways: internal structure of the expression (syntax tree), visual path of matching (explaining graph) and a text description. They also allow you to test you regex against several strings and see if it works as expected. Experiment and play with your regexes, see corresponding changes in the authoring tools, and eventually you'll get the regex you want.

Read the section on [[#Authoring tools|authoring tools]], than (probably after some experimenting with tools on your own) a start of section about [[#Understanding regular expressions|understanding regular expressions]] (this is optional, but may be interesting and help a lot). You should also read a section about [[#How Preg questions work|question working]] to better understand various settings and how they affects you questions.

===I can make some effort to learn regular expressions well and be able to do anything they allow===
Well, you don't know regexes but want to understand them and create complex expressions easily. Then, instad of blunt trying, you better spend some time and effort reading and understanding [[#Understanding regular expressions|this section]]. Then read slightly about [[#Authoring tools|authoring tools]] and use them to experiment creating regexes. With these tools you can see if you really understand them well and they behave as expected. Syntax tree may be especially useful when you try to get the right meaning of ''precedence'' and ''arity''. After you understand the principles of regexes well, read sections about [[#How Preg questions work|question working]] and [[#Regular expressions reference|regular expression reference]] (to know your possibilities, don't bother to understand or remember them all - just look there periodically for something new to learn). Now you should be able to write regexes without much use of authoring tools, except the testing tool to test your expressions.

===I know regular expressions well enought to write them on my own without further guidance===
You should read about [[#How Preg questions work|question working]] to understand various settings and question behaviour under them. You also may be interested in regex testing in the [[#Authoring tools|authoring tools]] section. Finally, [[#Regular expressions reference|regular expression reference]] may be of some use for you.

==How Preg questions work==
Basically, this question type is an extended version of Shortanswer. It extends its features in several different ways (you could use them in almost any combination):
* '''Pattern matching''' - using regular expressions you can create powerful patterns describing possible students answers
* '''Hinting''' - when students are stuck doing the question, you may allow them to ask for a next correct word (lexem) or a character (with possible penalty)

===Settings affecting question work===
Sets the case sensitivity for all regular expressions you specify as answers. Note that you can also [[#Local case-sensitivity modifiers|set the case sensitivity for regex parts]].

'''Exact matching''' affects the question in the following way:
; ''Yes'' : the ''entire'' student's response, from the first to the last letter, should match your regular expression
; ''No'' : student's response can just contain a ''part'' that matches your regex: for example, if the correct answer is "whole" then "the whole string" will be a correct student response

You still can set some of your regexes to match the whole student's response using [[#Anchoring|special regex syntax]].

'''Notations''' specify the "language" of your answers.
; ''Regular expression'' : a usual notation for regular expression. Precisely it is Perl-compatible regex dialect. You may write regex on multiple strings for better reading - line breaks will be ignored.
; ''Regular expression (extended)'' : useful for really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not inside character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything after '#' character untill the end of string is treated as commentary (# should not be escaped and should not be inside a character class).
; ''Moodle shortanswer'' : use it to avoid regex syntax at all: just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting features. You can skip all that is said on regexes there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

'''Matching engine''' specifies the program module that performs the regex matching. There is no 'best' matching engine - it depends on the features you want to use. Engines have different stability and offer different features to use.

; ''PHP preg extension'' : should be used when you '''don't need hinting''' and '''other engines are rejecting you expressions''' as too difficult or you encounter bugs in them. It is based on the native PHP preg_ functions. It supports 100% perl-compatible regex features, it is very stable and thoroughly tested. But it doesn't support partial matching, so (unless we storm PHP developers to add support of partial matching) there is '''no hinting'''. However it supports subpattern capturing. Choose it when you need complex regex features that other engines don't support.
; ''Non-deterministing finite state automata(NFA)'' : can be used to '''perform hinting''' for your students. NFA engine is a custom PHP code, it allows many (but not all) regex features and is thoroughly tested (it passes all tests from AT&T testregex suite and most tests from PCRE testinput1 suite for the features it supports, which means quite much), but still may contain bugs in rare cases. Unsupported features for now are lookaround assertions, recursion and conditional subpatterns.
; ''Deterministic finite state automata (DFA)'' : WARNING - this engine lacking support the past year. Use NFA engine instead if you can (i.e. don't get rejected regexes that this engine accepts), it allows almost anything DFA engine does, but NFA engine much more tested and stable.

===Hinting===
Hinting is supported by NFA and DFA engines in adaptive and interactive behaviours.

====Partial matching====
Hinting starts with '''partial matching'''. By partially correct response we understand a string that starts with correct characters (matching your regex) but on some character the match breaks. Assume you entered the regex
"'''are blue, white(,| and) red'''"
and a student answered
"they are blue, vhite and red"
In this situation the partial match is
"are blue, "
Note that the regex is unanchored ("Exact match" is set to "No") so the match may not start with the first character of the student's response (like in the example above: "they " is skipped). While using just partial matching the student will see the correct and incorrect parts:
they are blue, vhite and red

====General hinting rules====
Preg question type doesn't add hinted characters to the student's response (unlike the REGEXP question type), showing it separately instead for a number of reasons:
# It is student's responsibility whether he wants to add hinted character to the his response (and some more possibly).
# It slightly facilitates thinking about a hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not usually a desirable behaviour.
When possible, hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: "," or " " (leading to the " and" path). The question will choose "," because it leads to the shortest path to complete the match, while " " leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot the other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with grades from 50% to 100% will be used for hinting and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match with the student response.

====Next character hinting====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set the hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

====Next lexem (word) hinting====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language it can be a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show student either completion of the current lexem (if partial match ends inside it) or next one (if student complete the current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports these languages (there will be more):
; ''simple english'' : english language scanner recognize words, numbers and punctuation marks;
; ''C/C++ language'' : a programming language C (or C++);
; ''printf language'' : a special language for formatting strings in C/C++ programming language, you will have it disabled probably.

Administrator of the site can control what languages are available to the teachers, to avoid confusion. See the settings of the block "Formal languages" in the plugin settings menu.

Note that "lexem" typically isn't the word you would like your students to see on the hinting button. Each language define their own word for it. You can enter another word in the question description, if you don't like default ones.

===Subpattern capturing and feedback===
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' will be replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work in the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do this, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Looking for missing and misplaced things===
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's not problem but a feature.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and its parts), and test it. Authoring tools are activated by pressing the "edit" icon near the regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
; '''syntax tree''' : shows you the inner structure of regular expressions
; '''explaining graph''' : shows you how your expression will work in a graphical way
; '''description''' : formulates the meaning of your expression in English
; '''testing tool''' : allows you to enter strings and see how they match your regex

===Installation note and known technical issues===
To have ''syntax tree'' and ''explaining graph'' tools working you (or your site admin) have to install Graphviz[http://www.graphviz.org/Graphviz] on the server and fill the 'pathtodot' setting on you Moodle installation at Site Administration > Server > System Paths. Graphviz is used to draw pictures for you.

Syntax tree and explaining graph may not work correctly in old Opera versions - for some reason the images are not updated on user actions. Fortunately, there's a newer version 16 for Windows which works with authoring tools pretty well. On Linux you will have to use something else.

===Regular expression area===
Here you can edit your regular expression. Clicking on "Show" sends the regex to all tools - syntax tree, explaining graph, description and testing results will be updated. "Save" closes the authoring tools form and saves the regex and test strings in the main question editing form. "Cancel" closes the authoring tools form and discards all changes made there.

You can select part of regular expression there, and corresponding parts of syntax tree, explaining graph, description and matched part of the strings will be highlighted. It is possible to select part of regex text, that doesnt correspond with a logically completed part of regular expression. In that case you selection will be widened to the nearest logically completed part.

===Syntax tree===
As was said above, regular expressions, like all expressions, are trees of operators and operands. Syntax tree shows the inner structure of expression graphically: what is inside what. This will be the most useful if you know how to understand regular expressions or [[#Understanding regular expressions|learning to do this]].

If you don't understand operators and precedence conception well, it may have a small meaning to you. But it is still useful to find out, where you need parentheses: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown in green rectangle. You can select nodes of the tree to by pressing on them.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

The tree will show you names and numbers of all subexpressions (subpatterns), so you can check their numerations - and back references to it.
TODO - picture.

===Explaining graph===
The graph shows how regular expression works. Its nodes are matched characters, its edges show paths throught the nodes from the beginning to the end.
[[Image:qtype preg authortools4.png|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes are shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers which part of the string matched each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember subexpressions, you may use (?: ) instead of ( ) parentheses, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work. Selected part of the expression will be shown by yellow background color.

===Testing tool===
You can enter a set of strings there, one per line. These strings will be matched against your expression. You'll see coloured strings, showing which parts of your strings matched the expression, so you can test if it works as you expected. You will also see green check marks for the strings that match entire regular expressions (and will be graded for that regex) and red crosses for the strings that don't give full match. PHP preg matcher can't show partial matches, so it only shows full matches or nothing (to not mislead you that entire string is wrong).

If you selected a part of regex, you will be able to see what part of strings matches that part (usually in yellow color, but that may depend on you theme). NFA matcher will show that for any part of regex, PHP preg matcher - only for capturing subexpressions (subpatterns).

The strings for testing will be saved in database, if you save regex (they will be lost if you close window with "cancel" button) and (later) question.

==Understanding regular expressions==

===Understanding expressions in general===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or a sets of characters, that is allowed in particular position. '''A''' is a regular expressions that matches a single character 'A'. The '''operators''' in regular expressions define a way to combine individual characters in the pattern: sequence (''concatenation'' operator), alternative and repeating (it is called ''quantifier''). The concatenation is so simple operator, that it doesn't have any character for it at all - just write some characters in sequence, and they'll be concatenated. But it is still have precedence, so that the question can see, did you want to repeat a single character or a sequence of them. Alternative is written as vertical bar. There are many form of quantifiers - most commonly used are question mark (repeat zero or one times), asterisk (zero or more times) and plus (one or more times). You may specify mininimum and maximum number of repeats in curly braces - this is a quantifier too.

The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

Character classes allows you to specify several possible characters for one place. They can be defined in many different ways: by enumeration of characters in square brackets '''[as3]''', by ranges in square brackets '''[a-z]''', by special sequences ('''\d''' means any digit, '''\W''' anything except a letter, digit and underscore, '''<nowiki>[[:alpha:]]</nowiki>''' any letter etc). An important type of operand is a ''simple assertions'': they allow you to test some conditions - start of the string '''^''', end of the string '''$''' or word border '''\b'''.

You could find a list and more examples of operands and operators in [[#Regular expressions reference|reference]] section.

===Precedence and order of evaluation===
A '''quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "many times*" matches "manytime" followed by zero or more "s";
#* "(many times)*" matches "many times" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "first|second|third" matches "first" or "second" or "third";
#* "(first |second |)part" matches "first part" or "second part" or just "part" - typical use of an empty alternative (note that space is in alternative to not require it before just "part");
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "first|second*" matches "first" or "secon" followed by zero or more "d" like "secondddddd";
#* "(first|second)*" matches "first" or "second", repeated zero or more time in any order, like "firstsecondfirstfirst". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(1|2){2}" matches "11" or "12" or "21" or "22", not just "11" or "22";
#* "1{2}|2{2}" matches "11" or "22" only.

An internal structure of regular expression can be viewed well on the [[#Syntax tree|syntax tree]] (authoring tool). The operators that executed first are placed lower on the tree (or to the right on horizontal view), the operator that executed last is the root of the tree. You can compare tree and explaining graphs for the examples above in authoring tools if this section doesn't seems too clear to you. Remember, that "execution" of regular expression operator means linking them in the string: sequental, alternative linking, or repeating.

===Anchoring===
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^start|end$" will match "start" from the start of the string or "end" at the end of it;
* "^(start|end)$" using brackets to match exactly with "start" or "end";
* "^start$|^end$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

==Regular expressions reference==

===Operands===
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#* full list of characters needs escaping '''\ ^ $ . [ ] | ( ) ? * + { }'''
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

===Operators===
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

===Subpatterns and backreferences===
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

====Duplicate subpattern numbers and names====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

===Complex assertions===
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

===Local case-sensitivity modifiers===
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Error reporting===
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

==The ways to give back==
This project is free software, so it's hard to get any feedback. You shouldn't expect to get software which ideally suits you needs without telling anyone about these needs, or encouragement, or some non-difficult support to the authors. Sometimes as little as writing where you work and how you use (or what prevents you from using) Preg question type may help a lot.

This software is considered a scientific project and such things could be really useful and appreciated:
* an evidence that the results of our work (i.e. Preg questoin type) are really useful to people and were used in production environment;
* a cooperative work to research it's effectiveness for various applications - basically you need to write about how you use Preg and make some survey with you teachers and/or students about it - but it can include co-authoring a conference thesis or a journal article;
* cooperating in writing article or help publishing it in English-language journals (information and help in grants for further work is welcome too).

If you consider any way of helping, do not hesitate to write me about it and ask any questions about details. You may receive individual help during such work too (for example, doing cooperative research I may give you tips how to improve you regexes, etc).

I am a high school teacher, researcher and programmer who must do much on his main paid job and have not much free time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this though. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts - there are many settings in the question, and regexes can be quite complex, so it's hard to do all testing by developers themselves.

==Development plans==
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Improve a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Plik:qtype preg authortools4.png

2013-10-11T17:09:45Z

Oasychev: Oasychev uploaded a new version of "File:qtype preg authortools4.png": Update to new version of tools.

Plik:qtype preg authortools2.png

2013-10-11T17:00:56Z

Oasychev: Oasychev uploaded a new version of "File:qtype preg authortools2.png": Updated to the new state of authoring tools.

Preg question type

2013-10-05T22:49:06Z

Oasychev: /* Regular expression area */ - updated selection information

{{Questions}}Preg is a question type that uses regular expressions (regexes) to check student's responses (though you can use it without regexes for its hinting features). Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. [[#Ways to use Preg questions and this docs|First section]] should guide you to using of this docs, please use it with discretion. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm. There are many good regex manuals, I'm not going to repeat them here.

Authors:
# Idea, design, question type and behaviours code, hinting, error reporting, regular expression testing (authoring tool) - Oleg Sychev.
# Regex parsing, NFA regex matching engine, matchers testing, backup&restore, unicode support, selection in regex text (in authoring tools) - Valeriy Streltsov.
# DFA regex matching engine (deprecated for now) - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Syntax tree (authoring tool) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time.
Thanks to:
* Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type;
* Tim Hunt - for his polite and useful answers and commentaries that helped writing this question, also for joint work on extra_question_fields and extra_answer_fields code, that is useful to many question type developers;
* Bondarenko Vitaly - for conversion of a vast set of regular expression matching tests.
You, too, could aways [[#The ways to give back|help us]] a lot - regardless of the way you use Preg and your capabilities.

==Ways to use Preg questions and this docs==

===I don't (want to) know anything about regular expressions but next word (character) hinting seems useful===
Then you can use Preg question type just as Shortanswer with advanced hinting, without any knowledge about regular expressions. To do this, you need to choose
* '''Notation''' => Moodle shortanswer
* '''Engine''' => Non-deterministic finite state automata
* '''Exact matching''' => Yes

After that, you can just copy answers from you shortanswer questions. You may want to read the section about [[#Hinting|hinting]] to understand more about hinting settings.

===I have a vague knowledge of regular expressions, but want to use pattern matching===
If writing regular expressions is hard for you, but you want to use their strength as patterns, authoring tools may help you a lot to create your questions. The tools show you the meaning of your regex in different ways: internal structure of the expression (syntax tree), visual path of matching (explaining graph) and a text description. They also allow you to test you regex against several strings and see if it works as expected. Experiment and play with your regexes, see corresponding changes in the authoring tools, and eventually you'll get the regex you want.

Read the section on [[#Authoring tools|authoring tools]], than (probably after some experimenting with tools on your own) a start of section about [[#Understanding regular expressions|understanding regular expressions]] (this is optional, but may be interesting and help a lot). You should also read a section about [[#How Preg questions work|question working]] to better understand various settings and how they affects you questions.

===I can make some effort to learn regular expressions well and be able to do anything they allow===
Well, you don't know regexes but want to understand them and create complex expressions easily. Then, instad of blunt trying, you better spend some time and effort reading and understanding [[#Understanding regular expressions|this section]]. Then read slightly about [[#Authoring tools|authoring tools]] and use them to experiment creating regexes. With these tools you can see if you really understand them well and they behave as expected. Syntax tree may be especially useful when you try to get the right meaning of ''precedence'' and ''arity''. After you understand the principles of regexes well, read sections about [[#How Preg questions work|question working]] and [[#Regular expressions reference|regular expression reference]] (to know your possibilities, don't bother to understand or remember them all - just look there periodically for something new to learn). Now you should be able to write regexes without much use of authoring tools, except the testing tool to test your expressions.

===I know regular expressions well enought to write them on my own without further guidance===
You should read about [[#How Preg questions work|question working]] to understand various settings and question behaviour under them. You also may be interested in regex testing in the [[#Authoring tools|authoring tools]] section. Finally, [[#Regular expressions reference|regular expression reference]] may be of some use for you.

==How Preg questions work==
Basically, this question type is an extended version of Shortanswer. It extends its features in several different ways (you could use them in almost any combination):
* '''Pattern matching''' - using regular expressions you can create powerful patterns describing possible students answers
* '''Hinting''' - when students are stuck doing the question, you may allow them to ask for a next correct word (lexem) or a character (with possible penalty)

===Settings affecting question work===
Sets the case sensitivity for all regular expressions you specify as answers. Note that you can also [[#Local case-sensitivity modifiers|set the case sensitivity for regex parts]].

'''Exact matching''' affects the question in the following way:
; ''Yes'' : the ''entire'' student's response, from the first to the last letter, should match your regular expression
; ''No'' : student's response can just contain a ''part'' that matches your regex: for example, if the correct answer is "whole" then "the whole string" will be a correct student response

You still can set some of your regexes to match the whole student's response using [[#Anchoring|special regex syntax]].

'''Notations''' specify the "language" of your answers.
; ''Regular expression'' : a usual notation for regular expression. Precisely it is Perl-compatible regex dialect. You may write regex on multiple strings for better reading - line breaks will be ignored.
; ''Regular expression (extended)'' : useful for really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not inside character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything after '#' character untill the end of string is treated as commentary (# should not be escaped and should not be inside a character class).
; ''Moodle shortanswer'' : use it to avoid regex syntax at all: just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting features. You can skip all that is said on regexes there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

'''Matching engine''' specifies the program module that performs the regex matching. There is no 'best' matching engine - it depends on the features you want to use. Engines have different stability and offer different features to use.

; ''PHP preg extension'' : should be used when you '''don't need hinting''' and '''other engines are rejecting you expressions''' as too difficult or you encounter bugs in them. It is based on the native PHP preg_ functions. It supports 100% perl-compatible regex features, it is very stable and thoroughly tested. But it doesn't support partial matching, so (unless we storm PHP developers to add support of partial matching) there is '''no hinting'''. However it supports subpattern capturing. Choose it when you need complex regex features that other engines don't support.
; ''Non-deterministing finite state automata(NFA)'' : can be used to '''perform hinting''' for your students. NFA engine is a custom PHP code, it allows many (but not all) regex features and is thoroughly tested (it passes all tests from AT&T testregex suite and most tests from PCRE testinput1 suite for the features it supports, which means quite much), but still may contain bugs in rare cases. Unsupported features for now are lookaround assertions, recursion and conditional subpatterns.
; ''Deterministic finite state automata (DFA)'' : WARNING - this engine lacking support the past year. Use NFA engine instead if you can (i.e. don't get rejected regexes that this engine accepts), it allows almost anything DFA engine does, but NFA engine much more tested and stable.

===Hinting===
Hinting is supported by NFA and DFA engines in adaptive and interactive behaviours.

====Partial matching====
Hinting starts with '''partial matching'''. By partially correct response we understand a string that starts with correct characters (matching your regex) but on some character the match breaks. Assume you entered the regex
"'''are blue, white(,| and) red'''"
and a student answered
"they are blue, vhite and red"
In this situation the partial match is
"are blue, "
Note that the regex is unanchored ("Exact match" is set to "No") so the match may not start with the first character of the student's response (like in the example above: "they " is skipped). While using just partial matching the student will see the correct and incorrect parts:
they are blue, vhite and red

====General hinting rules====
Preg question type doesn't add hinted characters to the student's response (unlike the REGEXP question type), showing it separately instead for a number of reasons:
# It is student's responsibility whether he wants to add hinted character to the his response (and some more possibly).
# It slightly facilitates thinking about a hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not usually a desirable behaviour.
When possible, hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: "," or " " (leading to the " and" path). The question will choose "," because it leads to the shortest path to complete the match, while " " leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot the other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with grades from 50% to 100% will be used for hinting and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match with the student response.

====Next character hinting====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set the hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

====Next lexem (word) hinting====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language it can be a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show student either completion of the current lexem (if partial match ends inside it) or next one (if student complete the current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports these languages (there will be more):
; ''simple english'' : english language scanner recognize words, numbers and punctuation marks;
; ''C/C++ language'' : a programming language C (or C++);
; ''printf language'' : a special language for formatting strings in C/C++ programming language, you will have it disabled probably.

Administrator of the site can control what languages are available to the teachers, to avoid confusion. See the settings of the block "Formal languages" in the plugin settings menu.

Note that "lexem" typically isn't the word you would like your students to see on the hinting button. Each language define their own word for it. You can enter another word in the question description, if you don't like default ones.

===Subpattern capturing and feedback===
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' will be replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work in the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do this, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Looking for missing and misplaced things===
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's not problem but a feature.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and its parts), and test it. Authoring tools are activated by pressing the "edit" icon near the regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
; '''syntax tree''' : shows you the inner structure of regular expressions
; '''explaining graph''' : shows you how your expression will work in a graphical way
; '''description''' : formulates the meaning of your expression in English
; '''testing tool''' : allows you to enter strings and see how they match your regex

===Installation note and known technical issues===
To have ''syntax tree'' and ''explaining graph'' tools working you (or your site admin) have to install Graphviz[http://www.graphviz.org/Graphviz] on the server and fill the 'pathtodot' setting on you Moodle installation at Site Administration > Server > System Paths. Graphviz is used to draw pictures for you.

Syntax tree and explaining graph may not work correctly in old Opera versions - for some reason the images are not updated on user actions. Fortunately, there's a newer version 16 for Windows which works with authoring tools pretty well. On Linux you will have to use something else.

===Regular expression area===
Here you can edit your regular expression. Clicking on "Show" sends the regex to all tools - syntax tree, explaining graph, description and testing results will be updated. "Save" closes the authoring tools form and saves the regex and test strings in the main question editing form. "Cancel" closes the authoring tools form and discards all changes made there.

You can select part of regular expression there, and corresponding parts of syntax tree, explaining graph, description and matched part of the strings will be highlighted. It is possible to select part of regex text, that doesnt correspond with a logically completed part of regular expression. In that case you selection will be widened to the nearest logically completed part.

===Syntax tree===
As was said above, regular expressions, like all expressions, are trees of operators and operands. Syntax tree shows the inner structure of expression graphically: what is inside what. This will be the most useful if you know how to understand regular expressions or [[#Understanding regular expressions|learning to do this]].

If you don't understand operators and precedence conception well, it may have a small meaning to you. But it is still useful to find out, where you need parentheses: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown by dotted part of the tree.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

===Explaining graph===
The graph shows how regular expression works. Its nodes are matched characters, its edges show paths throught the nodes from the beginning to the end.
[[Image:qtype preg authortools4.png|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes are shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers which part of the string matched each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember subexpressions, you may use (?: ) instead of ( ) parentheses, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

TODO Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work.

===Testing tool===
You can enter a set of strings there, one per line. These strings will be matched against your expression. You'll see a coloured strings, showing which parts of your strings matched the expression, so you can test if it works as you expected. You will also see green check mark for the strings that match entire regular expressions and red crosses for the strings (and will be graded for that regex), that doesn't give full match. PHP preg matcher can't show partial matches, so it only shows full matches or nothing (to not mislead you that entire string is wrong).

If you selected part of regex, you will be able to see what part of strings matches with that part of regex (usually in yellow color, but that may depend on you theme). NFA matcher will show that for any part of regex, PHP preg matcher - only for the capturing subexpressions (subpattern).

The strings for testing will be saved in database, if you save regex (they will be lost if you close window with "cancel" button) and (later) question.

==Understanding regular expressions==

===Understanding expressions in general===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or a sets of characters, that is allowed in particular position. '''A''' is a regular expressions that matches a single character 'A'. The '''operators''' in regular expressions define a way to combine individual characters in the pattern: sequence (''concatenation'' operator), alternative and repeating (it is called ''quantifier''). The concatenation is so simple operator, that it doesn't have any character for it at all - just write some characters in sequence, and they'll be concatenated. But it is still have precedence, so that the question can see, did you want to repeat a single character or a sequence of them. Alternative is written as vertical bar. There are many form of quantifiers - most commonly used are question mark (repeat zero or one times), asterisk (zero or more times) and plus (one or more times). You may specify mininimum and maximum number of repeats in curly braces - this is a quantifier too.

The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

Character classes allows you to specify several possible characters for one place. They can be defined in many different ways: by enumeration of characters in square brackets '''[as3]''', by ranges in square brackets '''[a-z]''', by special sequences ('''\d''' means any digit, '''\W''' anything except a letter, digit and underscore, '''<nowiki>[[:alpha:]]</nowiki>''' any letter etc). An important type of operand is a ''simple assertions'': they allow you to test some conditions - start of the string '''^''', end of the string '''$''' or word border '''\b'''.

You could find a list and more examples of operands and operators in [[#Regular expressions reference|reference]] section.

===Precedence and order of evaluation===
A '''quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "many times*" matches "manytime" followed by zero or more "s";
#* "(many times)*" matches "many times" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "first|second|third" matches "first" or "second" or "third";
#* "(first |second |)part" matches "first part" or "second part" or just "part" - typical use of an empty alternative (note that space is in alternative to not require it before just "part");
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "first|second*" matches "first" or "secon" followed by zero or more "d" like "secondddddd";
#* "(first|second)*" matches "first" or "second", repeated zero or more time in any order, like "firstsecondfirstfirst". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(1|2){2}" matches "11" or "12" or "21" or "22", not just "11" or "22";
#* "1{2}|2{2}" matches "11" or "22" only.

An internal structure of regular expression can be viewed well on the [[#Syntax tree|syntax tree]] (authoring tool). The operators that executed first are placed lower on the tree (or to the right on horizontal view), the operator that executed last is the root of the tree. You can compare tree and explaining graphs for the examples above in authoring tools if this section doesn't seems too clear to you. Remember, that "execution" of regular expression operator means linking them in the string: sequental, alternative linking, or repeating.

===Anchoring===
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^start|end$" will match "start" from the start of the string or "end" at the end of it;
* "^(start|end)$" using brackets to match exactly with "start" or "end";
* "^start$|^end$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

==Regular expressions reference==

===Operands===
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#* full list of characters needs escaping '''\ ^ $ . [ ] | ( ) ? * + { }'''
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

===Operators===
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

===Subpatterns and backreferences===
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

====Duplicate subpattern numbers and names====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

===Complex assertions===
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

===Local case-sensitivity modifiers===
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Error reporting===
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

==The ways to give back==
This project is free software, so it's hard to get any feedback. You shouldn't expect to get software which ideally suits you needs without telling anyone about these needs, or encouragement, or some non-difficult support to the authors. Sometimes as little as writing where you work and how you use (or what prevents you from using) Preg question type may help a lot.

This software is considered a scientific project and such things could be really useful and appreciated:
* an evidence that the results of our work (i.e. Preg questoin type) are really useful to people and were used in production environment;
* a cooperative work to research it's effectiveness for various applications - basically you need to write about how you use Preg and make some survey with you teachers and/or students about it - but it can include co-authoring a conference thesis or a journal article;
* cooperating in writing article or help publishing it in English-language journals (information and help in grants for further work is welcome too).

If you consider any way of helping, do not hesitate to write me about it and ask any questions about details. You may receive individual help during such work too (for example, doing cooperative research I may give you tips how to improve you regexes, etc).

I am a high school teacher, researcher and programmer who must do much on his main paid job and have not much free time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this though. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts - there are many settings in the question, and regexes can be quite complex, so it's hard to do all testing by developers themselves.

==Development plans==
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Improve a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2013-10-05T22:41:08Z

Oasychev: /* Testing tool */ - added information about selection and icons

{{Questions}}Preg is a question type that uses regular expressions (regexes) to check student's responses (though you can use it without regexes for its hinting features). Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. [[#Ways to use Preg questions and this docs|First section]] should guide you to using of this docs, please use it with discretion. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm. There are many good regex manuals, I'm not going to repeat them here.

Authors:
# Idea, design, question type and behaviours code, hinting, error reporting, regular expression testing (authoring tool) - Oleg Sychev.
# Regex parsing, NFA regex matching engine, matchers testing, backup&restore, unicode support, selection in regex text (in authoring tools) - Valeriy Streltsov.
# DFA regex matching engine (deprecated for now) - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Syntax tree (authoring tool) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time.
Thanks to:
* Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type;
* Tim Hunt - for his polite and useful answers and commentaries that helped writing this question, also for joint work on extra_question_fields and extra_answer_fields code, that is useful to many question type developers;
* Bondarenko Vitaly - for conversion of a vast set of regular expression matching tests.
You, too, could aways [[#The ways to give back|help us]] a lot - regardless of the way you use Preg and your capabilities.

==Ways to use Preg questions and this docs==

===I don't (want to) know anything about regular expressions but next word (character) hinting seems useful===
Then you can use Preg question type just as Shortanswer with advanced hinting, without any knowledge about regular expressions. To do this, you need to choose
* '''Notation''' => Moodle shortanswer
* '''Engine''' => Non-deterministic finite state automata
* '''Exact matching''' => Yes

After that, you can just copy answers from you shortanswer questions. You may want to read the section about [[#Hinting|hinting]] to understand more about hinting settings.

===I have a vague knowledge of regular expressions, but want to use pattern matching===
If writing regular expressions is hard for you, but you want to use their strength as patterns, authoring tools may help you a lot to create your questions. The tools show you the meaning of your regex in different ways: internal structure of the expression (syntax tree), visual path of matching (explaining graph) and a text description. They also allow you to test you regex against several strings and see if it works as expected. Experiment and play with your regexes, see corresponding changes in the authoring tools, and eventually you'll get the regex you want.

Read the section on [[#Authoring tools|authoring tools]], than (probably after some experimenting with tools on your own) a start of section about [[#Understanding regular expressions|understanding regular expressions]] (this is optional, but may be interesting and help a lot). You should also read a section about [[#How Preg questions work|question working]] to better understand various settings and how they affects you questions.

===I can make some effort to learn regular expressions well and be able to do anything they allow===
Well, you don't know regexes but want to understand them and create complex expressions easily. Then, instad of blunt trying, you better spend some time and effort reading and understanding [[#Understanding regular expressions|this section]]. Then read slightly about [[#Authoring tools|authoring tools]] and use them to experiment creating regexes. With these tools you can see if you really understand them well and they behave as expected. Syntax tree may be especially useful when you try to get the right meaning of ''precedence'' and ''arity''. After you understand the principles of regexes well, read sections about [[#How Preg questions work|question working]] and [[#Regular expressions reference|regular expression reference]] (to know your possibilities, don't bother to understand or remember them all - just look there periodically for something new to learn). Now you should be able to write regexes without much use of authoring tools, except the testing tool to test your expressions.

===I know regular expressions well enought to write them on my own without further guidance===
You should read about [[#How Preg questions work|question working]] to understand various settings and question behaviour under them. You also may be interested in regex testing in the [[#Authoring tools|authoring tools]] section. Finally, [[#Regular expressions reference|regular expression reference]] may be of some use for you.

==How Preg questions work==
Basically, this question type is an extended version of Shortanswer. It extends its features in several different ways (you could use them in almost any combination):
* '''Pattern matching''' - using regular expressions you can create powerful patterns describing possible students answers
* '''Hinting''' - when students are stuck doing the question, you may allow them to ask for a next correct word (lexem) or a character (with possible penalty)

===Settings affecting question work===
Sets the case sensitivity for all regular expressions you specify as answers. Note that you can also [[#Local case-sensitivity modifiers|set the case sensitivity for regex parts]].

'''Exact matching''' affects the question in the following way:
; ''Yes'' : the ''entire'' student's response, from the first to the last letter, should match your regular expression
; ''No'' : student's response can just contain a ''part'' that matches your regex: for example, if the correct answer is "whole" then "the whole string" will be a correct student response

You still can set some of your regexes to match the whole student's response using [[#Anchoring|special regex syntax]].

'''Notations''' specify the "language" of your answers.
; ''Regular expression'' : a usual notation for regular expression. Precisely it is Perl-compatible regex dialect. You may write regex on multiple strings for better reading - line breaks will be ignored.
; ''Regular expression (extended)'' : useful for really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not inside character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything after '#' character untill the end of string is treated as commentary (# should not be escaped and should not be inside a character class).
; ''Moodle shortanswer'' : use it to avoid regex syntax at all: just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting features. You can skip all that is said on regexes there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

'''Matching engine''' specifies the program module that performs the regex matching. There is no 'best' matching engine - it depends on the features you want to use. Engines have different stability and offer different features to use.

; ''PHP preg extension'' : should be used when you '''don't need hinting''' and '''other engines are rejecting you expressions''' as too difficult or you encounter bugs in them. It is based on the native PHP preg_ functions. It supports 100% perl-compatible regex features, it is very stable and thoroughly tested. But it doesn't support partial matching, so (unless we storm PHP developers to add support of partial matching) there is '''no hinting'''. However it supports subpattern capturing. Choose it when you need complex regex features that other engines don't support.
; ''Non-deterministing finite state automata(NFA)'' : can be used to '''perform hinting''' for your students. NFA engine is a custom PHP code, it allows many (but not all) regex features and is thoroughly tested (it passes all tests from AT&T testregex suite and most tests from PCRE testinput1 suite for the features it supports, which means quite much), but still may contain bugs in rare cases. Unsupported features for now are lookaround assertions, recursion and conditional subpatterns.
; ''Deterministic finite state automata (DFA)'' : WARNING - this engine lacking support the past year. Use NFA engine instead if you can (i.e. don't get rejected regexes that this engine accepts), it allows almost anything DFA engine does, but NFA engine much more tested and stable.

===Hinting===
Hinting is supported by NFA and DFA engines in adaptive and interactive behaviours.

====Partial matching====
Hinting starts with '''partial matching'''. By partially correct response we understand a string that starts with correct characters (matching your regex) but on some character the match breaks. Assume you entered the regex
"'''are blue, white(,| and) red'''"
and a student answered
"they are blue, vhite and red"
In this situation the partial match is
"are blue, "
Note that the regex is unanchored ("Exact match" is set to "No") so the match may not start with the first character of the student's response (like in the example above: "they " is skipped). While using just partial matching the student will see the correct and incorrect parts:
they are blue, vhite and red

====General hinting rules====
Preg question type doesn't add hinted characters to the student's response (unlike the REGEXP question type), showing it separately instead for a number of reasons:
# It is student's responsibility whether he wants to add hinted character to the his response (and some more possibly).
# It slightly facilitates thinking about a hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not usually a desirable behaviour.
When possible, hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: "," or " " (leading to the " and" path). The question will choose "," because it leads to the shortest path to complete the match, while " " leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot the other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with grades from 50% to 100% will be used for hinting and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match with the student response.

====Next character hinting====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set the hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

====Next lexem (word) hinting====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language it can be a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show student either completion of the current lexem (if partial match ends inside it) or next one (if student complete the current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports these languages (there will be more):
; ''simple english'' : english language scanner recognize words, numbers and punctuation marks;
; ''C/C++ language'' : a programming language C (or C++);
; ''printf language'' : a special language for formatting strings in C/C++ programming language, you will have it disabled probably.

Administrator of the site can control what languages are available to the teachers, to avoid confusion. See the settings of the block "Formal languages" in the plugin settings menu.

Note that "lexem" typically isn't the word you would like your students to see on the hinting button. Each language define their own word for it. You can enter another word in the question description, if you don't like default ones.

===Subpattern capturing and feedback===
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' will be replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work in the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do this, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Looking for missing and misplaced things===
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's not problem but a feature.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and its parts), and test it. Authoring tools are activated by pressing the "edit" icon near the regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
; '''syntax tree''' : shows you the inner structure of regular expressions
; '''explaining graph''' : shows you how your expression will work in a graphical way
; '''description''' : formulates the meaning of your expression in English
; '''testing tool''' : allows you to enter strings and see how they match your regex

===Installation note and known technical issues===
To have ''syntax tree'' and ''explaining graph'' tools working you (or your site admin) have to install Graphviz[http://www.graphviz.org/Graphviz] on the server and fill the 'pathtodot' setting on you Moodle installation at Site Administration > Server > System Paths. Graphviz is used to draw pictures for you.

Syntax tree and explaining graph may not work correctly in old Opera versions - for some reason the images are not updated on user actions. Fortunately, there's a newer version 16 for Windows which works with authoring tools pretty well. On Linux you will have to use something else.

===Regular expression area===
Here you can edit your regular expression. Clicking on "Update" sends the regex to all tools - so syntax tree, explaining graph, description and testing results are updated. "Save" closes the authoring tools form and saves the regex in the main question editing form. "Cancel" closes the authoring tools form and discards all changes made there. The last, "TODO" button, helps you to trace the interrelation between the regex itself (text representation) and the other representations: you can select a regex part and see where it is located in the syntax tree and in the graph.

===Syntax tree===
As was said above, regular expressions, like all expressions, are trees of operators and operands. Syntax tree shows the inner structure of expression graphically: what is inside what. This will be the most useful if you know how to understand regular expressions or [[#Understanding regular expressions|learning to do this]].

If you don't understand operators and precedence conception well, it may have a small meaning to you. But it is still useful to find out, where you need parentheses: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown by dotted part of the tree.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

===Explaining graph===
The graph shows how regular expression works. Its nodes are matched characters, its edges show paths throught the nodes from the beginning to the end.
[[Image:qtype preg authortools4.png|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes are shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers which part of the string matched each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember subexpressions, you may use (?: ) instead of ( ) parentheses, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

TODO Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work.

===Testing tool===
You can enter a set of strings there, one per line. These strings will be matched against your expression. You'll see a coloured strings, showing which parts of your strings matched the expression, so you can test if it works as you expected. You will also see green check mark for the strings that match entire regular expressions and red crosses for the strings (and will be graded for that regex), that doesn't give full match. PHP preg matcher can't show partial matches, so it only shows full matches or nothing (to not mislead you that entire string is wrong).

If you selected part of regex, you will be able to see what part of strings matches with that part of regex (usually in yellow color, but that may depend on you theme). NFA matcher will show that for any part of regex, PHP preg matcher - only for the capturing subexpressions (subpattern).

The strings for testing will be saved in database, if you save regex (they will be lost if you close window with "cancel" button) and (later) question.

==Understanding regular expressions==

===Understanding expressions in general===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or a sets of characters, that is allowed in particular position. '''A''' is a regular expressions that matches a single character 'A'. The '''operators''' in regular expressions define a way to combine individual characters in the pattern: sequence (''concatenation'' operator), alternative and repeating (it is called ''quantifier''). The concatenation is so simple operator, that it doesn't have any character for it at all - just write some characters in sequence, and they'll be concatenated. But it is still have precedence, so that the question can see, did you want to repeat a single character or a sequence of them. Alternative is written as vertical bar. There are many form of quantifiers - most commonly used are question mark (repeat zero or one times), asterisk (zero or more times) and plus (one or more times). You may specify mininimum and maximum number of repeats in curly braces - this is a quantifier too.

The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

Character classes allows you to specify several possible characters for one place. They can be defined in many different ways: by enumeration of characters in square brackets '''[as3]''', by ranges in square brackets '''[a-z]''', by special sequences ('''\d''' means any digit, '''\W''' anything except a letter, digit and underscore, '''<nowiki>[[:alpha:]]</nowiki>''' any letter etc). An important type of operand is a ''simple assertions'': they allow you to test some conditions - start of the string '''^''', end of the string '''$''' or word border '''\b'''.

You could find a list and more examples of operands and operators in [[#Regular expressions reference|reference]] section.

===Precedence and order of evaluation===
A '''quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "many times*" matches "manytime" followed by zero or more "s";
#* "(many times)*" matches "many times" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "first|second|third" matches "first" or "second" or "third";
#* "(first |second |)part" matches "first part" or "second part" or just "part" - typical use of an empty alternative (note that space is in alternative to not require it before just "part");
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "first|second*" matches "first" or "secon" followed by zero or more "d" like "secondddddd";
#* "(first|second)*" matches "first" or "second", repeated zero or more time in any order, like "firstsecondfirstfirst". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(1|2){2}" matches "11" or "12" or "21" or "22", not just "11" or "22";
#* "1{2}|2{2}" matches "11" or "22" only.

An internal structure of regular expression can be viewed well on the [[#Syntax tree|syntax tree]] (authoring tool). The operators that executed first are placed lower on the tree (or to the right on horizontal view), the operator that executed last is the root of the tree. You can compare tree and explaining graphs for the examples above in authoring tools if this section doesn't seems too clear to you. Remember, that "execution" of regular expression operator means linking them in the string: sequental, alternative linking, or repeating.

===Anchoring===
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^start|end$" will match "start" from the start of the string or "end" at the end of it;
* "^(start|end)$" using brackets to match exactly with "start" or "end";
* "^start$|^end$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

==Regular expressions reference==

===Operands===
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#* full list of characters needs escaping '''\ ^ $ . [ ] | ( ) ? * + { }'''
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

===Operators===
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

===Subpatterns and backreferences===
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

====Duplicate subpattern numbers and names====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

===Complex assertions===
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

===Local case-sensitivity modifiers===
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Error reporting===
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

==The ways to give back==
This project is free software, so it's hard to get any feedback. You shouldn't expect to get software which ideally suits you needs without telling anyone about these needs, or encouragement, or some non-difficult support to the authors. Sometimes as little as writing where you work and how you use (or what prevents you from using) Preg question type may help a lot.

This software is considered a scientific project and such things could be really useful and appreciated:
* an evidence that the results of our work (i.e. Preg questoin type) are really useful to people and were used in production environment;
* a cooperative work to research it's effectiveness for various applications - basically you need to write about how you use Preg and make some survey with you teachers and/or students about it - but it can include co-authoring a conference thesis or a journal article;
* cooperating in writing article or help publishing it in English-language journals (information and help in grants for further work is welcome too).

If you consider any way of helping, do not hesitate to write me about it and ask any questions about details. You may receive individual help during such work too (for example, doing cooperative research I may give you tips how to improve you regexes, etc).

I am a high school teacher, researcher and programmer who must do much on his main paid job and have not much free time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this though. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts - there are many settings in the question, and regexes can be quite complex, so it's hard to do all testing by developers themselves.

==Development plans==
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Improve a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2013-10-05T22:22:35Z

Oasychev: Changed developers information

{{Questions}}Preg is a question type that uses regular expressions (regexes) to check student's responses (though you can use it without regexes for its hinting features). Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. [[#Ways to use Preg questions and this docs|First section]] should guide you to using of this docs, please use it with discretion. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm. There are many good regex manuals, I'm not going to repeat them here.

Authors:
# Idea, design, question type and behaviours code, hinting, error reporting, regular expression testing (authoring tool) - Oleg Sychev.
# Regex parsing, NFA regex matching engine, matchers testing, backup&restore, unicode support, selection in regex text (in authoring tools) - Valeriy Streltsov.
# DFA regex matching engine (deprecated for now) - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Syntax tree (authoring tool) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time.
Thanks to:
* Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type;
* Tim Hunt - for his polite and useful answers and commentaries that helped writing this question, also for joint work on extra_question_fields and extra_answer_fields code, that is useful to many question type developers;
* Bondarenko Vitaly - for conversion of a vast set of regular expression matching tests.
You, too, could aways [[#The ways to give back|help us]] a lot - regardless of the way you use Preg and your capabilities.

==Ways to use Preg questions and this docs==

===I don't (want to) know anything about regular expressions but next word (character) hinting seems useful===
Then you can use Preg question type just as Shortanswer with advanced hinting, without any knowledge about regular expressions. To do this, you need to choose
* '''Notation''' => Moodle shortanswer
* '''Engine''' => Non-deterministic finite state automata
* '''Exact matching''' => Yes

After that, you can just copy answers from you shortanswer questions. You may want to read the section about [[#Hinting|hinting]] to understand more about hinting settings.

===I have a vague knowledge of regular expressions, but want to use pattern matching===
If writing regular expressions is hard for you, but you want to use their strength as patterns, authoring tools may help you a lot to create your questions. The tools show you the meaning of your regex in different ways: internal structure of the expression (syntax tree), visual path of matching (explaining graph) and a text description. They also allow you to test you regex against several strings and see if it works as expected. Experiment and play with your regexes, see corresponding changes in the authoring tools, and eventually you'll get the regex you want.

Read the section on [[#Authoring tools|authoring tools]], than (probably after some experimenting with tools on your own) a start of section about [[#Understanding regular expressions|understanding regular expressions]] (this is optional, but may be interesting and help a lot). You should also read a section about [[#How Preg questions work|question working]] to better understand various settings and how they affects you questions.

===I can make some effort to learn regular expressions well and be able to do anything they allow===
Well, you don't know regexes but want to understand them and create complex expressions easily. Then, instad of blunt trying, you better spend some time and effort reading and understanding [[#Understanding regular expressions|this section]]. Then read slightly about [[#Authoring tools|authoring tools]] and use them to experiment creating regexes. With these tools you can see if you really understand them well and they behave as expected. Syntax tree may be especially useful when you try to get the right meaning of ''precedence'' and ''arity''. After you understand the principles of regexes well, read sections about [[#How Preg questions work|question working]] and [[#Regular expressions reference|regular expression reference]] (to know your possibilities, don't bother to understand or remember them all - just look there periodically for something new to learn). Now you should be able to write regexes without much use of authoring tools, except the testing tool to test your expressions.

===I know regular expressions well enought to write them on my own without further guidance===
You should read about [[#How Preg questions work|question working]] to understand various settings and question behaviour under them. You also may be interested in regex testing in the [[#Authoring tools|authoring tools]] section. Finally, [[#Regular expressions reference|regular expression reference]] may be of some use for you.

==How Preg questions work==
Basically, this question type is an extended version of Shortanswer. It extends its features in several different ways (you could use them in almost any combination):
* '''Pattern matching''' - using regular expressions you can create powerful patterns describing possible students answers
* '''Hinting''' - when students are stuck doing the question, you may allow them to ask for a next correct word (lexem) or a character (with possible penalty)

===Settings affecting question work===
Sets the case sensitivity for all regular expressions you specify as answers. Note that you can also [[#Local case-sensitivity modifiers|set the case sensitivity for regex parts]].

'''Exact matching''' affects the question in the following way:
; ''Yes'' : the ''entire'' student's response, from the first to the last letter, should match your regular expression
; ''No'' : student's response can just contain a ''part'' that matches your regex: for example, if the correct answer is "whole" then "the whole string" will be a correct student response

You still can set some of your regexes to match the whole student's response using [[#Anchoring|special regex syntax]].

'''Notations''' specify the "language" of your answers.
; ''Regular expression'' : a usual notation for regular expression. Precisely it is Perl-compatible regex dialect. You may write regex on multiple strings for better reading - line breaks will be ignored.
; ''Regular expression (extended)'' : useful for really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not inside character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything after '#' character untill the end of string is treated as commentary (# should not be escaped and should not be inside a character class).
; ''Moodle shortanswer'' : use it to avoid regex syntax at all: just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting features. You can skip all that is said on regexes there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

'''Matching engine''' specifies the program module that performs the regex matching. There is no 'best' matching engine - it depends on the features you want to use. Engines have different stability and offer different features to use.

; ''PHP preg extension'' : should be used when you '''don't need hinting''' and '''other engines are rejecting you expressions''' as too difficult or you encounter bugs in them. It is based on the native PHP preg_ functions. It supports 100% perl-compatible regex features, it is very stable and thoroughly tested. But it doesn't support partial matching, so (unless we storm PHP developers to add support of partial matching) there is '''no hinting'''. However it supports subpattern capturing. Choose it when you need complex regex features that other engines don't support.
; ''Non-deterministing finite state automata(NFA)'' : can be used to '''perform hinting''' for your students. NFA engine is a custom PHP code, it allows many (but not all) regex features and is thoroughly tested (it passes all tests from AT&T testregex suite and most tests from PCRE testinput1 suite for the features it supports, which means quite much), but still may contain bugs in rare cases. Unsupported features for now are lookaround assertions, recursion and conditional subpatterns.
; ''Deterministic finite state automata (DFA)'' : WARNING - this engine lacking support the past year. Use NFA engine instead if you can (i.e. don't get rejected regexes that this engine accepts), it allows almost anything DFA engine does, but NFA engine much more tested and stable.

===Hinting===
Hinting is supported by NFA and DFA engines in adaptive and interactive behaviours.

====Partial matching====
Hinting starts with '''partial matching'''. By partially correct response we understand a string that starts with correct characters (matching your regex) but on some character the match breaks. Assume you entered the regex
"'''are blue, white(,| and) red'''"
and a student answered
"they are blue, vhite and red"
In this situation the partial match is
"are blue, "
Note that the regex is unanchored ("Exact match" is set to "No") so the match may not start with the first character of the student's response (like in the example above: "they " is skipped). While using just partial matching the student will see the correct and incorrect parts:
they are blue, vhite and red

====General hinting rules====
Preg question type doesn't add hinted characters to the student's response (unlike the REGEXP question type), showing it separately instead for a number of reasons:
# It is student's responsibility whether he wants to add hinted character to the his response (and some more possibly).
# It slightly facilitates thinking about a hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not usually a desirable behaviour.
When possible, hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: "," or " " (leading to the " and" path). The question will choose "," because it leads to the shortest path to complete the match, while " " leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot the other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with grades from 50% to 100% will be used for hinting and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match with the student response.

====Next character hinting====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set the hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

====Next lexem (word) hinting====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language it can be a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show student either completion of the current lexem (if partial match ends inside it) or next one (if student complete the current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports these languages (there will be more):
; ''simple english'' : english language scanner recognize words, numbers and punctuation marks;
; ''C/C++ language'' : a programming language C (or C++);
; ''printf language'' : a special language for formatting strings in C/C++ programming language, you will have it disabled probably.

Administrator of the site can control what languages are available to the teachers, to avoid confusion. See the settings of the block "Formal languages" in the plugin settings menu.

Note that "lexem" typically isn't the word you would like your students to see on the hinting button. Each language define their own word for it. You can enter another word in the question description, if you don't like default ones.

===Subpattern capturing and feedback===
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' will be replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work in the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do this, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Looking for missing and misplaced things===
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's not problem but a feature.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and its parts), and test it. Authoring tools are activated by pressing the "edit" icon near the regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
; '''syntax tree''' : shows you the inner structure of regular expressions
; '''explaining graph''' : shows you how your expression will work in a graphical way
; '''description''' : formulates the meaning of your expression in English
; '''testing tool''' : allows you to enter strings and see how they match your regex

===Installation note and known technical issues===
To have ''syntax tree'' and ''explaining graph'' tools working you (or your site admin) have to install Graphviz[http://www.graphviz.org/Graphviz] on the server and fill the 'pathtodot' setting on you Moodle installation at Site Administration > Server > System Paths. Graphviz is used to draw pictures for you.

Syntax tree and explaining graph may not work correctly in old Opera versions - for some reason the images are not updated on user actions. Fortunately, there's a newer version 16 for Windows which works with authoring tools pretty well. On Linux you will have to use something else.

===Regular expression area===
Here you can edit your regular expression. Clicking on "Update" sends the regex to all tools - so syntax tree, explaining graph, description and testing results are updated. "Save" closes the authoring tools form and saves the regex in the main question editing form. "Cancel" closes the authoring tools form and discards all changes made there. The last, "TODO" button, helps you to trace the interrelation between the regex itself (text representation) and the other representations: you can select a regex part and see where it is located in the syntax tree and in the graph.

===Syntax tree===
As was said above, regular expressions, like all expressions, are trees of operators and operands. Syntax tree shows the inner structure of expression graphically: what is inside what. This will be the most useful if you know how to understand regular expressions or [[#Understanding regular expressions|learning to do this]].

If you don't understand operators and precedence conception well, it may have a small meaning to you. But it is still useful to find out, where you need parentheses: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown by dotted part of the tree.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

===Explaining graph===
The graph shows how regular expression works. Its nodes are matched characters, its edges show paths throught the nodes from the beginning to the end.
[[Image:qtype preg authortools4.png|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes are shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers which part of the string matched each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember subexpressions, you may use (?: ) instead of ( ) parentheses, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

TODO Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work.

===Testing tool===
You can enter a set of strings there, one per line. These strings will be matched against your expression. You'll see a coloured strings, showing which parts of your strings matched the expression, so you can test if it works as you expcted.

TODO The strings will be saved in database, if you save regex (they will be lost if you close window with "cancel" button).

==Understanding regular expressions==

===Understanding expressions in general===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or a sets of characters, that is allowed in particular position. '''A''' is a regular expressions that matches a single character 'A'. The '''operators''' in regular expressions define a way to combine individual characters in the pattern: sequence (''concatenation'' operator), alternative and repeating (it is called ''quantifier''). The concatenation is so simple operator, that it doesn't have any character for it at all - just write some characters in sequence, and they'll be concatenated. But it is still have precedence, so that the question can see, did you want to repeat a single character or a sequence of them. Alternative is written as vertical bar. There are many form of quantifiers - most commonly used are question mark (repeat zero or one times), asterisk (zero or more times) and plus (one or more times). You may specify mininimum and maximum number of repeats in curly braces - this is a quantifier too.

The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

Character classes allows you to specify several possible characters for one place. They can be defined in many different ways: by enumeration of characters in square brackets '''[as3]''', by ranges in square brackets '''[a-z]''', by special sequences ('''\d''' means any digit, '''\W''' anything except a letter, digit and underscore, '''<nowiki>[[:alpha:]]</nowiki>''' any letter etc). An important type of operand is a ''simple assertions'': they allow you to test some conditions - start of the string '''^''', end of the string '''$''' or word border '''\b'''.

You could find a list and more examples of operands and operators in [[#Regular expressions reference|reference]] section.

===Precedence and order of evaluation===
A '''quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "many times*" matches "manytime" followed by zero or more "s";
#* "(many times)*" matches "many times" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "first|second|third" matches "first" or "second" or "third";
#* "(first |second |)part" matches "first part" or "second part" or just "part" - typical use of an empty alternative (note that space is in alternative to not require it before just "part");
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "first|second*" matches "first" or "secon" followed by zero or more "d" like "secondddddd";
#* "(first|second)*" matches "first" or "second", repeated zero or more time in any order, like "firstsecondfirstfirst". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(1|2){2}" matches "11" or "12" or "21" or "22", not just "11" or "22";
#* "1{2}|2{2}" matches "11" or "22" only.

An internal structure of regular expression can be viewed well on the [[#Syntax tree|syntax tree]] (authoring tool). The operators that executed first are placed lower on the tree (or to the right on horizontal view), the operator that executed last is the root of the tree. You can compare tree and explaining graphs for the examples above in authoring tools if this section doesn't seems too clear to you. Remember, that "execution" of regular expression operator means linking them in the string: sequental, alternative linking, or repeating.

===Anchoring===
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^start|end$" will match "start" from the start of the string or "end" at the end of it;
* "^(start|end)$" using brackets to match exactly with "start" or "end";
* "^start$|^end$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

==Regular expressions reference==

===Operands===
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#* full list of characters needs escaping '''\ ^ $ . [ ] | ( ) ? * + { }'''
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

===Operators===
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

===Subpatterns and backreferences===
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

====Duplicate subpattern numbers and names====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

===Complex assertions===
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

===Local case-sensitivity modifiers===
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Error reporting===
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

==The ways to give back==
This project is free software, so it's hard to get any feedback. You shouldn't expect to get software which ideally suits you needs without telling anyone about these needs, or encouragement, or some non-difficult support to the authors. Sometimes as little as writing where you work and how you use (or what prevents you from using) Preg question type may help a lot.

This software is considered a scientific project and such things could be really useful and appreciated:
* an evidence that the results of our work (i.e. Preg questoin type) are really useful to people and were used in production environment;
* a cooperative work to research it's effectiveness for various applications - basically you need to write about how you use Preg and make some survey with you teachers and/or students about it - but it can include co-authoring a conference thesis or a journal article;
* cooperating in writing article or help publishing it in English-language journals (information and help in grants for further work is welcome too).

If you consider any way of helping, do not hesitate to write me about it and ask any questions about details. You may receive individual help during such work too (for example, doing cooperative research I may give you tips how to improve you regexes, etc).

I am a high school teacher, researcher and programmer who must do much on his main paid job and have not much free time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this though. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts - there are many settings in the question, and regexes can be quite complex, so it's hard to do all testing by developers themselves.

==Development plans==
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Improve a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2013-09-07T19:26:32Z

Oasychev:

{{Questions}}Preg is a question type that uses regular expressions (regexes) to check student's responses (though you can use it without regexes for its hinting features). Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. [[#Ways to use Preg questions and this docs|First section]] should guide you to using of this docs, please use it with discretion. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm. There are many good regex manuals, I'm not going to repeat them here.

Authors:
# Idea, design, question type and behaviours code, hinting and error reporting - Oleg Sychev.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore and unicode support - Valeriy Streltsov.
# DFA regex matching engine - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Syntax tree, regular expression testing (authoring tools) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type.
You, too, could aways [[#The ways to give back|help us]] a lot - regardless of the way you use Preg and your capabilities.

==Ways to use Preg questions and this docs==

===I don't (want to) know anything about regular expressions but next word (character) hinting seems useful===
Then you can use Preg question type just as Shortanswer with advanced hinting, without any knowledge about regular expressions. To do this, you need to choose
* '''Notation''' => Moodle shortanswer
* '''Engine''' => Non-deterministic finite state automata
* '''Exact matching''' => Yes

After that, you can just copy answers from you shortanswer questions. You may want to read the section about [[#Hinting|hinting]] to understand more about hinting settings.

===I have a vague knowledge of regular expressions, but want to use pattern matching===
If writing regular expressions is hard for you, but you want to use their strength as patterns, authoring tools may help you a lot to create your questions. The tools show you the meaning of your regex in different ways: internal structure of the expression (syntax tree), visual path of matching (explaining graph) and a text description. They also allow you to test you regex against several strings and see if it works as expected. Experiment and play with your regexes, see corresponding changes in the authoring tools, and eventually you'll get the regex you want.

Read the section on [[#Authoring tools|authoring tools]], than (probably after some experimenting with tools on your own) a start of section about [[#Understanding regular expressions|understanding regular expressions]] (this is optional, but may be interesting and help a lot). You should also read a section about [[#How Preg questions work|question working]] to better understand various settings and how they affects you questions.

===I can make some effort to learn regular expressions well and be able to do anything they allow===
Well, you don't know regexes but want to understand them and create complex expressions easily. Then, instad of blunt trying, you better spend some time and effort reading and understanding [[#Understanding regular expressions|this section]]. Then read slightly about [[#Authoring tools|authoring tools]] and use them to experiment creating regexes. With these tools you can see if you really understand them well and they behave as expected. Syntax tree may be especially useful when you try to get the right meaning of ''precedence'' and ''arity''. After you understand the principles of regexes well, read sections about [[#How Preg questions work|question working]] and [[#Regular expressions reference|regular expression reference]] (to know your possibilities, don't bother to understand or remember them all - just look there periodically for something new to learn). Now you should be able to write regexes without much use of authoring tools, except the testing tool to test your expressions.

===I know regular expressions well enought to write them on my own without further guidance===
You should read about [[#How Preg questions work|question working]] to understand various settings and question behaviour under them. You also may be interested in regex testing in the [[#Authoring tools|authoring tools]] section. Finally, [[#Regular expressions reference|regular expression reference]] may be of some use for you.

==How Preg questions work==
Basically, this question type is an extended version of Shortanswer. It extends its features in several different ways (you could use them in almost any combination):
* '''Pattern matching''' - using regular expressions you can create powerful patterns describing possible students answers
* '''Hinting''' - when students are stuck doing the question, you may allow them to ask for a next correct word (lexem) or a character (with possible penalty)

===Settings affecting question work===
Sets the case sensitivity for all regular expressions you specify as answers. Note that you can also [[#Local case-sensitivity modifiers|set the case sensitivity for regex parts]].

'''Exact matching''' affects the question in the following way:
; ''Yes'' : the ''entire'' student's response, from the first to the last letter, should match your regular expression
; ''No'' : student's response can just contain a ''part'' that matches your regex: for example, if the correct answer is "whole" then "the whole string" will be a correct student response

You still can set some of your regexes to match the whole student's response using [[#Anchoring|special regex syntax]].

'''Notations''' specify the "language" of your answers.
; ''Regular expression'' : a usual notation for regular expression. Precisely it is Perl-compatible regex dialect. You may write regex on multiple strings for better reading - line breaks will be ignored.
; ''Regular expression (extended)'' : useful for really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not inside character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything after '#' character untill the end of string is treated as commentary (# should not be escaped and should not be inside a character class).
; ''Moodle shortanswer'' : use it to avoid regex syntax at all: just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting features. You can skip all that is said on regexes there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

'''Matching engine''' specifies the program module that performs the regex matching. There is no 'best' matching engine - it depends on the features you want to use. Engines have different stability and offer different features to use.

; ''PHP preg extension'' : should be used when you '''don't need hinting''' and '''other engines are rejecting you expressions''' as too difficult or you encounter bugs in them. It is based on the native PHP preg_ functions. It supports 100% perl-compatible regex features, it is very stable and thoroughly tested. But it doesn't support partial matching, so (unless we storm PHP developers to add support of partial matching) there is '''no hinting'''. However it supports subpattern capturing. Choose it when you need complex regex features that other engines don't support.
; ''Non-deterministing finite state automata(NFA)'' : can be used to '''perform hinting''' for your students. NFA engine is a custom PHP code, it allows many (but not all) regex features and is thoroughly tested (it passes all tests from AT&T testregex suite and most tests from PCRE testinput1 suite for the features it supports, which means quite much), but still may contain bugs in rare cases. Unsupported features for now are lookaround assertions, recursion and conditional subpatterns.
; ''Deterministic finite state automata (DFA)'' : WARNING - this engine lacking support the past year. Use NFA engine instead if you can (i.e. don't get rejected regexes that this engine accepts), it allows almost anything DFA engine does, but NFA engine much more tested and stable.

===Hinting===
Hinting is supported by NFA and DFA engines in adaptive and interactive behaviours.

====Partial matching====
Hinting starts with '''partial matching'''. By partially correct response we understand a string that starts with correct characters (matching your regex) but on some character the match breaks. Assume you entered the regex
"'''are blue, white(,| and) red'''"
and a student answered
"they are blue, vhite and red"
In this situation the partial match is
"are blue, "
Note that the regex is unanchored ("Exact match" is set to "No") so the match may not start with the first character of the student's response (like in the example above: "they " is skipped). While using just partial matching the student will see the correct and incorrect parts:
they are blue, vhite and red

====General hinting rules====
Preg question type doesn't add hinted characters to the student's response (unlike the REGEXP question type), showing it separately instead for a number of reasons:
# It is student's responsibility whether he wants to add hinted character to the his response (and some more possibly).
# It slightly facilitates thinking about a hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not usually a desirable behaviour.
When possible, hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: "," or " " (leading to the " and" path). The question will choose "," because it leads to the shortest path to complete the match, while " " leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot the other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with grades from 50% to 100% will be used for hinting and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match with the student response.

====Next character hinting====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set the hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

====Next lexem (word) hinting====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language it can be a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show student either completion of the current lexem (if partial match ends inside it) or next one (if student complete the current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports these languages (there will be more):
; ''simple english'' : english language scanner recognize words, numbers and punctuation marks;
; ''C/C++ language'' : a programming language C (or C++);
; ''printf language'' : a special language for formatting strings in C/C++ programming language, you will have it disabled probably.

Administrator of the site can control what languages are available to the teachers, to avoid confusion. See the settings of the block "Formal languages" in the plugin settings menu.

Note that "lexem" typically isn't the word you would like your students to see on the hinting button. Each language define their own word for it. You can enter another word in the question description, if you don't like default ones.

===Subpattern capturing and feedback===
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' will be replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work in the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do this, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Looking for missing and misplaced things===
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's not problem but a feature.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and its parts), and test it. Authoring tools are activated by pressing the "edit" icon near the regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
; '''syntax tree''' : shows you the inner structure of regular expressions
; '''explaining graph''' : shows you how your expression will work in a graphical way
; '''description''' : formulates the meaning of your expression in English
; '''testing tool''' : allows you to enter strings and see how they match your regex

===Installation note and known technical issues===
To have ''syntax tree'' and ''explaining graph'' tools working you (or your site admin) have to install Graphviz[http://www.graphviz.org/Graphviz] on the server and fill the 'pathtodot' setting on you Moodle installation at Site Administration > Server > System Paths. Graphviz is used to draw pictures for you.

Syntax tree and explaining graph may not work correctly in old Opera versions - for some reason the images are not updated on user actions. Fortunately, there's a newer version 16 for Windows which works with authoring tools pretty well. On Linux you will have to use something else.

===Regular expression area===
Here you can edit your regular expression. Clicking on "Update" sends the regex to all tools - so syntax tree, explaining graph, description and testing results are updated. "Save" closes the authoring tools form and saves the regex in the main question editing form. "Cancel" closes the authoring tools form and discards all changes made there. The last, "TODO" button, helps you to trace the interrelation between the regex itself (text representation) and the other representations: you can select a regex part and see where it is located in the syntax tree and in the graph.

===Syntax tree===
As was said above, regular expressions, like all expressions, are trees of operators and operands. Syntax tree shows the inner structure of expression graphically: what is inside what. This will be the most useful if you know how to understand regular expressions or [[#Understanding regular expressions|learning to do this]].

If you don't understand operators and precedence conception well, it may have a small meaning to you. But it is still useful to find out, where you need parentheses: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown by dotted part of the tree.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

===Explaining graph===
The graph shows how regular expression works. Its nodes are matched characters, its edges show paths throught the nodes from the beginning to the end.
[[Image:qtype preg authortools4.png|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes are shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers which part of the string matched each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember subexpressions, you may use (?: ) instead of ( ) parentheses, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

TODO Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work.

===Testing tool===
You can enter a set of strings there, one per line. These strings will be matched against your expression. You'll see a coloured strings, showing which parts of your strings matched the expression, so you can test if it works as you expcted.

TODO The strings will be saved in database, if you save regex (they will be lost if you close window with "cancel" button).

==Understanding regular expressions==

===Understanding expressions in general===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or a sets of characters, that is allowed in particular position. '''A''' is a regular expressions that matches a single character 'A'. The '''operators''' in regular expressions define a way to combine individual characters in the pattern: sequence (''concatenation'' operator), alternative and repeating (it is called ''quantifier''). The concatenation is so simple operator, that it doesn't have any character for it at all - just write some characters in sequence, and they'll be concatenated. But it is still have precedence, so that the question can see, did you want to repeat a single character or a sequence of them. Alternative is written as vertical bar. There are many form of quantifiers - most commonly used are question mark (repeat zero or one times), asterisk (zero or more times) and plus (one or more times). You may specify mininimum and maximum number of repeats in curly braces - this is a quantifier too.

The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

Character classes allows you to specify several possible characters for one place. They can be defined in many different ways: by enumeration of characters in square brackets '''[as3]''', by ranges in square brackets '''[a-z]''', by special sequences ('''\d''' means any digit, '''\W''' anything except a letter, digit and underscore, '''<nowiki>[[:alpha:]]</nowiki>''' any letter etc). An important type of operand is a ''simple assertions'': they allow you to test some conditions - start of the string '''^''', end of the string '''$''' or word border '''\b'''.

You could find a list and more examples of operands and operators in [[#Regular expressions reference|reference]] section.

===Precedence and order of evaluation===
A '''quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "many times*" matches "manytime" followed by zero or more "s";
#* "(many times)*" matches "many times" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "first|second|third" matches "first" or "second" or "third";
#* "(first |second |)part" matches "first part" or "second part" or just "part" - typical use of an empty alternative (note that space is in alternative to not require it before just "part");
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "first|second*" matches "first" or "secon" followed by zero or more "d" like "secondddddd";
#* "(first|second)*" matches "first" or "second", repeated zero or more time in any order, like "firstsecondfirstfirst". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(1|2){2}" matches "11" or "12" or "21" or "22", not just "11" or "22";
#* "1{2}|2{2}" matches "11" or "22" only.

An internal structure of regular expression can be viewed well on the [[#Syntax tree|syntax tree]] (authoring tool). The operators that executed first are placed lower on the tree (or to the right on horizontal view), the operator that executed last is the root of the tree. You can compare tree and explaining graphs for the examples above in authoring tools if this section doesn't seems too clear to you. Remember, that "execution" of regular expression operator means linking them in the string: sequental, alternative linking, or repeating.

===Anchoring===
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^start|end$" will match "start" from the start of the string or "end" at the end of it;
* "^(start|end)$" using brackets to match exactly with "start" or "end";
* "^start$|^end$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

==Regular expressions reference==

===Operands===
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#* full list of characters needs escaping '''\ ^ $ . [ ] | ( ) ? * + { }'''
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

===Operators===
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

===Subpatterns and backreferences===
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

====Duplicate subpattern numbers and names====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

===Complex assertions===
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

===Local case-sensitivity modifiers===
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Error reporting===
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

==The ways to give back==
This project is free software, so it's hard to get any feedback. You shouldn't expect to get software which ideally suits you needs without telling anyone about these needs, or encouragement, or some non-difficult support to the authors. Sometimes as little as writing where you work and how you use (or what prevents you from using) Preg question type may help a lot.

This software is considered a scientific project and such things could be really useful and appreciated:
* an evidence that the results of our work (i.e. Preg questoin type) are really useful to people and were used in production environment;
* a cooperative work to research it's effectiveness for various applications - basically you need to write about how you use Preg and make some survey with you teachers and/or students about it - but it can include co-authoring a conference thesis or a journal article;
* cooperating in writing article or help publishing it in English-language journals (information and help in grants for further work is welcome too).

If you consider any way of helping, do not hesitate to write me about it and ask any questions about details. You may receive individual help during such work too (for example, doing cooperative research I may give you tips how to improve you regexes, etc).

I am a high school teacher, researcher and programmer who must do much on his main paid job and have not much free time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this though. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts - there are many settings in the question, and regexes can be quite complex, so it's hard to do all testing by developers themselves.

==Development plans==
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Improve a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2013-08-28T20:43:30Z

Oasychev: /* Anchoring */ more human-readable examples

{{Questions}}Preg is a question type that uses regular expressions (regexes) to check student's responses (though you can use it without regexes for its hinting features). Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. [[#Ways to use Preg questions and this docs|First section]] should guide you to using of this docs, please use it with discretion. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm. There are many good regex manuals, I'm not going to repeat them here.

Authors:
# Idea, design, question type and behaviours code, hinting and error reporting - Oleg Sychev.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore and unicode support - Valeriy Streltsov.
# DFA regex matching engine - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Explaining tree, regular expression testing (authoring tools) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type.

==Ways to use Preg questions and this docs==
A little foreword: regardless of the way you use Preg and your capabilities, you could aways [[#The ways to give back|give back]]. It is hard to get any feedback to freely distributed software. But you shouldn't expect to get software which ideally suits you needs without telling anyone about these needs, or encouragement, or some non-difficult support to the authors. Sometimes as little as writing where you work and how you use (or what prevents you from using) Preg question type may help a lot.

===I don't (want to) know anything about regular expressions but next word (character) hinting seems useful===
Then you can use Preg question type just as Shortanswer with advanced hinting, without any knowledge about regular expressions. To do this, you need to choose
* '''Notation''' => Moodle shortanswer
* '''Engine''' => Non-deterministic finite state automata
* '''Exact matching''' => Yes

After that, you can just copy answers from you shortanswer questions. You may want to read the section about [[#Hinting|hinting]] to understand more about hinting settings.

===I have a vague knowledge of regular expressions, but want to use pattern matching===
If writing regular expressions is hard for you, but you want to use their strength as patterns, authoring tools may help you a lot to create your questions. The tools show you the meaning of your regex in different ways: internal structure of the expression (syntax tree), visual path of matching (explaining graph) and a text description. They also allow you to test you regex against several strings and see if it works as expected. Experiment and play with your regexes, see corresponding changes in the authoring tools, and eventually you'll get the regex you want.

Read the section on [[#Authoring tools|authoring tools]], than (probably after some experimenting with tools on your own) a start of section about [[#Understanding regular expressions|understanding regular expressions]] (this is optional, but may be interesting and help a lot). You should also read a section about [[#How Preg questions work|question working]] to better understand various settings and how they affects you questions.

===I can make some effort to learn regular expressions well and be able to do anything they allow===
Well, you don't know regexes but want to understand them and create complex expressions easily. Then, instad of blunt trying, you better spend some time and effort reading and understanding [[#Understanding regular expressions|this section]]. Then read slightly about [[#Authoring tools|authoring tools]] and use them to experiment creating regexes. With these tools you can see if you really understand them well and they behave as expected. Syntax tree may be especially useful when you try to get the right meaning of ''precedence'' and ''arity''. After you understand the principles of regexes well, read sections about [[#How Preg questions work|question working]] and [[#Regular expressions reference|regular expression reference]] (to know your possibilities, don't bother to understand or remember them all - just look there periodically for something new to learn). Now you should be able to write regexes without much use of authoring tools, except the testing tool to test your expressions.

===I know regular expressions well enought to write them on my own without further guidance===
You should read about [[#How Preg questions work|question working]] to understand various settings and question behaviour under them. You also may be interested in regex testing in the [[#Authoring tools|authoring tools]] section. Finally, [[#Regular expressions reference|regular expression reference]] may be of some use for you.

==How Preg questions work==
Basically, this question type is an extended version of Shortanswer. It extends its features in several different ways (you could use them in almost any combination):
* '''Pattern matching''' - using regular expressions you can create powerful patterns describing possible students answers
* '''Hinting''' - when students are stuck doing the question, you may allow them to ask for a next correct word (lexem) or a character (with possible penalty)

===Settings affecting question work===
Sets the case sensitivity for all regular expressions you specify as answers. Note that you can also [[#Local case-sensitivity modifiers|set the case sensitivity for regex parts]].

'''Exact matching''' affects the question in the following way:
; ''Yes'' : the ''entire'' student's response, from the first to the last letter, should match your regular expression
; ''No'' : student's response can just contain a ''part'' that matches your regex: for example, if the correct answer is "whole" then "the whole string" will be a correct student response

You still can set some of your regexes to match the whole student's response using [[#Anchoring|special regex syntax]].

'''Notations''' specify the "language" of your answers.
; ''Regular expression'' : a usual notation for regular expression. Precisely it is Perl-compatible regex dialect. You may write regex on multiple strings for better reading - line breaks will be ignored.
; ''Regular expression (extended)'' : useful for really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not inside character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything after '#' character untill the end of string is treated as commentary (# should not be escaped and should not be inside a character class).
; ''Moodle shortanswer'' : use it to avoid regex syntax at all: just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting features. You can skip all that is said on regexes there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

'''Matching engine''' specifies the program module that performs the regex matching. There is no 'best' matching engine - it depends on the features you want to use. Engines have different stability and offer different features to use.

; ''PHP preg extension'' : should be used when you '''don't need hinting''' and '''other engines are rejecting you expressions''' as too difficult or you encounter bugs in them. It is based on the native PHP preg_ functions. It supports 100% perl-compatible regex features, it is very stable and thoroughly tested. But it doesn't support partial matching, so (unless we storm PHP developers to add support of partial matching) there is '''no hinting'''. However it supports subpattern capturing. Choose it when you need complex regex features that other engines don't support.
; ''Non-deterministing finite state automata(NFA)'' : can be used to '''perform hinting''' for your students. NFA engine is a custom PHP code, it allows many (but not all) regex features and is thoroughly tested (it passes all tests from AT&T testregex suite and most tests from PCRE testinput1 suite for the features it supports, which means quite much), but still may contain bugs in rare cases. Unsupported features for now are lookaround assertions, recursion and conditional subpatterns.
; ''Deterministic finite state automata (DFA)'' : WARNING - this engine lacking support the past year. Use NFA engine instead if you can (i.e. don't get rejected regexes that this engine accepts), it allows almost anything DFA engine does, but NFA engine much more tested and stable.

===Hinting===
NFA and DFA matching engines support hinting in the adaptive and interactive behaviours.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion is unanchored ("Exact match" is set to "No") so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

====General hinting rules====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not usually a desirable behaviour.
When possible, hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with graeds from 50% to 100% will be used for hinting and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match with the student response.

====Next character hinting====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

====Next lexem (word) hinting====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language they are a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports these languages (but there will be more):
; ''simple english'' : english language scanner recognize words, numbers and punctuation marks;
; ''C/C++ language'' : a programming language C (or C++);
; ''prinf language'' : a special language for formatting strings in C/C++ programming language, you will have it disabled probably.

Administrator of the site can control what languages are available to the teachers, to avoid confusion. See the settings of the block "Formal languages" in the plugin settings menu.

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. Each language define their own word for it. You can enter another word in the question description, if you don't like default ones.

===Subpattern capturing and feedback===
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' will be replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work in the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do this, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Looking for missing and misplaced things===
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's not problem but a feature.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and its parts), and test it. Authoring tools are activated by pressing the "edit" icon near the regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
; '''syntax tree''' : shows you the inner structure of regular expressions
; '''explaining graph''' : shows you how your expression will work in a graphical way
; '''description''' : formulates the meaning of your expression in English
; '''testing tool''' : allows you to enter strings and see how they match your regex

INSTALLATION NOTE. To have ''syntax tree'' and ''explaining graph'' tools working you need to have the [http://www.graphviz.org/Graphviz] package installed on your server and fill 'pathtodot' setting on you Moodle installation at Site Administration > Server > System Paths. It is used by authoring tools code to draw pictures for you.

===Regular expression area===
Here you can edit your regular expression. Clicking on "Update" sends the regex to all tools - so syntax tree, explaining graph, description and testing results are updated. "Save" closes the authoring tools form and saves the regex in the main question editing form. "Cancel" closes the authoring tools form and discards all changes made there. The last, "TODO" button, helps you to trace the interrelation between the regex itself (text representation) and the other representations: you can select a regex part and see where it is located in the syntax tree and in the graph.

===Syntax tree===
As was said above, regular expressions, like all expressions, are trees of operators and operands. Syntax tree shows the inner structure of expression graphically: what is inside what. This will be the most useful if you know how to understand regular expressions or [[#Understanding regular expressions|learning to do this]].

If you don't understand operators and precedence conception well, it may have a small meaning to you. But it is still useful to find out, where you need parentheses: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown by dotted part of the tree.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

===Explaining graph===
The graph shows how regular expression works. Its nodes are matched characters, its edges show paths throught the nodes from the beginning to the end.
[[Image:qtype preg authortools4.png|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes are shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers which part of the string matched each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember subexpressions, you may use (?: ) instead of ( ) parentheses, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

TODO Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work.

===Testing tool===
You can enter a set of strings there, one per line. These strings will be matched against your expression. You'll see a coloured strings, showing which parts of your strings matched the expression, so you can test if it works as you expcted.

TODO The strings will be saved in database, if you save regex (they will be lost if you close window with "cancel" button).

==Understanding regular expressions==

===Understanding expressions in general===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or a sets of characters, that is allowed in particular position. '''A''' is a regular expressions that matches a single character 'A'. The '''operators''' in regular expressions define a way to combine individual characters in the pattern: sequence (''concatenation'' operator), alternative and repeating (it is called ''quantifier''). The concatenation is so simple operator, that it doesn't have any character for it at all - just write some characters in sequence, and they'll be concatenated. But it is still have precedence, so that the question can see, did you want to repeat a single character or a sequence of them. Alternative is written as vertical bar. There are many form of quantifiers - most commonly used are question mark (repeat zero or one times), asterisk (zero or more times) and plus (one or more times). You may specify mininimum and maximum number of repeats in curly braces - this is a quantifier too.

The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

Character classes allows you to specify several possible characters for one place. They can be defined in many different ways: by enumeration of characters in square brackets '''[as3]''', by ranges in square brackets '''[a-z]''', by special sequences ('''\d''' means any digit, '''\W''' anything except a letter, digit and underscore, '''<nowiki>[[:alpha:]]</nowiki>''' any letter etc). An important type of operand is a ''simple assertions'': they allow you to test some conditions - start of the string '''^''', end of the string '''$''' or word border '''\b'''.

You could find a list and more examples of operands and operators in [[#Regular expressions reference|reference]] section.

===Precedence and order of evaluation===
A '''quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "many times*" matches "manytime" followed by zero or more "s";
#* "(many times)*" matches "many times" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "first|second|third" matches "first" or "second" or "third";
#* "(first |second |)part" matches "first part" or "second part" or just "part" - typical use of an empty alternative (note that space is in alternative to not require it before just "part");
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "first|second*" matches "first" or "secon" followed by zero or more "d" like "secondddddd";
#* "(first|second)*" matches "first" or "second", repeated zero or more time in any order, like "firstsecondfirstfirst". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(1|2){2}" matches "11" or "12" or "21" or "22", not just "11" or "22";
#* "1{2}|2{2}" matches "11" or "22" only.

An internal structure of regular expression can be viewed well on the [[#Syntax tree|syntax tree]] (authoring tool). The operators that executed first are placed lower on the tree (or to the right on horizontal view), the operator that executed last is the root of the tree. You can compare tree and explaining graphs for the examples above in authoring tools if this section doesn't seems too clear to you. Remember, that "execution" of regular expression operator means linking them in the string: sequental, alternative linking, or repeating.

===Anchoring===
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^start|end$" will match "start" from the start of the string or "end" at the end of it;
* "^(start|end)$" using brackets to match exactly with "start" or "end";
* "^start$|^end$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

==Regular expressions reference==

===Operands===
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#* full list of characters needs escaping '''\ ^ $ . [ ] | ( ) ? * + { }'''
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

===Operators===
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

===Subpatterns and backreferences===
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

====Duplicate subpattern numbers and names====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

===Complex assertions===
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

===Local case-sensitivity modifiers===
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Error reporting===
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

==The ways to give back==
This software is considered scientific project and as such needs help any scientific project can use:
* an evidence, that results of our work (i.e. Preg questoin type) is really useful to people and was used in production environment;
* a cooperative work to research it's effectiveness for various applications - basically you need to write about how you used this question type and make some survey with you teachers and/or students about it - but it can include co-authoring a conference thesis or journal article;
* cooperating in writing article or help in publishing it in english-language journals (information and help in grants for further work is welcome too).

If you consider any way of helping, do not hesitate to write me about it and ask any questions about details. You may receive individual help during such work too (for example, doing cooperative research I may give you tips how to improve you regexes, etc).

I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts - there are many settings in the question, and regex can be quite complex, so it's hard to do all testing by developers themselves.

==Development plans==
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Improve a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2013-08-28T20:42:07Z

Oasychev: /* Precedence and order of evaluation */ - made more human-readable examples

{{Questions}}Preg is a question type that uses regular expressions (regexes) to check student's responses (though you can use it without regexes for its hinting features). Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. [[#Ways to use Preg questions and this docs|First section]] should guide you to using of this docs, please use it with discretion. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm. There are many good regex manuals, I'm not going to repeat them here.

Authors:
# Idea, design, question type and behaviours code, hinting and error reporting - Oleg Sychev.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore and unicode support - Valeriy Streltsov.
# DFA regex matching engine - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Explaining tree, regular expression testing (authoring tools) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type.

==Ways to use Preg questions and this docs==
A little foreword: regardless of the way you use Preg and your capabilities, you could aways [[#The ways to give back|give back]]. It is hard to get any feedback to freely distributed software. But you shouldn't expect to get software which ideally suits you needs without telling anyone about these needs, or encouragement, or some non-difficult support to the authors. Sometimes as little as writing where you work and how you use (or what prevents you from using) Preg question type may help a lot.

===I don't (want to) know anything about regular expressions but next word (character) hinting seems useful===
Then you can use Preg question type just as Shortanswer with advanced hinting, without any knowledge about regular expressions. To do this, you need to choose
* '''Notation''' => Moodle shortanswer
* '''Engine''' => Non-deterministic finite state automata
* '''Exact matching''' => Yes

After that, you can just copy answers from you shortanswer questions. You may want to read the section about [[#Hinting|hinting]] to understand more about hinting settings.

===I have a vague knowledge of regular expressions, but want to use pattern matching===
If writing regular expressions is hard for you, but you want to use their strength as patterns, authoring tools may help you a lot to create your questions. The tools show you the meaning of your regex in different ways: internal structure of the expression (syntax tree), visual path of matching (explaining graph) and a text description. They also allow you to test you regex against several strings and see if it works as expected. Experiment and play with your regexes, see corresponding changes in the authoring tools, and eventually you'll get the regex you want.

Read the section on [[#Authoring tools|authoring tools]], than (probably after some experimenting with tools on your own) a start of section about [[#Understanding regular expressions|understanding regular expressions]] (this is optional, but may be interesting and help a lot). You should also read a section about [[#How Preg questions work|question working]] to better understand various settings and how they affects you questions.

===I can make some effort to learn regular expressions well and be able to do anything they allow===
Well, you don't know regexes but want to understand them and create complex expressions easily. Then, instad of blunt trying, you better spend some time and effort reading and understanding [[#Understanding regular expressions|this section]]. Then read slightly about [[#Authoring tools|authoring tools]] and use them to experiment creating regexes. With these tools you can see if you really understand them well and they behave as expected. Syntax tree may be especially useful when you try to get the right meaning of ''precedence'' and ''arity''. After you understand the principles of regexes well, read sections about [[#How Preg questions work|question working]] and [[#Regular expressions reference|regular expression reference]] (to know your possibilities, don't bother to understand or remember them all - just look there periodically for something new to learn). Now you should be able to write regexes without much use of authoring tools, except the testing tool to test your expressions.

===I know regular expressions well enought to write them on my own without further guidance===
You should read about [[#How Preg questions work|question working]] to understand various settings and question behaviour under them. You also may be interested in regex testing in the [[#Authoring tools|authoring tools]] section. Finally, [[#Regular expressions reference|regular expression reference]] may be of some use for you.

==How Preg questions work==
Basically, this question type is an extended version of Shortanswer. It extends its features in several different ways (you could use them in almost any combination):
* '''Pattern matching''' - using regular expressions you can create powerful patterns describing possible students answers
* '''Hinting''' - when students are stuck doing the question, you may allow them to ask for a next correct word (lexem) or a character (with possible penalty)

===Settings affecting question work===
Sets the case sensitivity for all regular expressions you specify as answers. Note that you can also [[#Local case-sensitivity modifiers|set the case sensitivity for regex parts]].

'''Exact matching''' affects the question in the following way:
; ''Yes'' : the ''entire'' student's response, from the first to the last letter, should match your regular expression
; ''No'' : student's response can just contain a ''part'' that matches your regex: for example, if the correct answer is "whole" then "the whole string" will be a correct student response

You still can set some of your regexes to match the whole student's response using [[#Anchoring|special regex syntax]].

'''Notations''' specify the "language" of your answers.
; ''Regular expression'' : a usual notation for regular expression. Precisely it is Perl-compatible regex dialect. You may write regex on multiple strings for better reading - line breaks will be ignored.
; ''Regular expression (extended)'' : useful for really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not inside character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything after '#' character untill the end of string is treated as commentary (# should not be escaped and should not be inside a character class).
; ''Moodle shortanswer'' : use it to avoid regex syntax at all: just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting features. You can skip all that is said on regexes there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

'''Matching engine''' specifies the program module that performs the regex matching. There is no 'best' matching engine - it depends on the features you want to use. Engines have different stability and offer different features to use.

; ''PHP preg extension'' : should be used when you '''don't need hinting''' and '''other engines are rejecting you expressions''' as too difficult or you encounter bugs in them. It is based on the native PHP preg_ functions. It supports 100% perl-compatible regex features, it is very stable and thoroughly tested. But it doesn't support partial matching, so (unless we storm PHP developers to add support of partial matching) there is '''no hinting'''. However it supports subpattern capturing. Choose it when you need complex regex features that other engines don't support.
; ''Non-deterministing finite state automata(NFA)'' : can be used to '''perform hinting''' for your students. NFA engine is a custom PHP code, it allows many (but not all) regex features and is thoroughly tested (it passes all tests from AT&T testregex suite and most tests from PCRE testinput1 suite for the features it supports, which means quite much), but still may contain bugs in rare cases. Unsupported features for now are lookaround assertions, recursion and conditional subpatterns.
; ''Deterministic finite state automata (DFA)'' : WARNING - this engine lacking support the past year. Use NFA engine instead if you can (i.e. don't get rejected regexes that this engine accepts), it allows almost anything DFA engine does, but NFA engine much more tested and stable.

===Hinting===
NFA and DFA matching engines support hinting in the adaptive and interactive behaviours.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion is unanchored ("Exact match" is set to "No") so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

====General hinting rules====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not usually a desirable behaviour.
When possible, hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with graeds from 50% to 100% will be used for hinting and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match with the student response.

====Next character hinting====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

====Next lexem (word) hinting====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language they are a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports these languages (but there will be more):
; ''simple english'' : english language scanner recognize words, numbers and punctuation marks;
; ''C/C++ language'' : a programming language C (or C++);
; ''prinf language'' : a special language for formatting strings in C/C++ programming language, you will have it disabled probably.

Administrator of the site can control what languages are available to the teachers, to avoid confusion. See the settings of the block "Formal languages" in the plugin settings menu.

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. Each language define their own word for it. You can enter another word in the question description, if you don't like default ones.

===Subpattern capturing and feedback===
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' will be replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work in the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do this, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Looking for missing and misplaced things===
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's not problem but a feature.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and its parts), and test it. Authoring tools are activated by pressing the "edit" icon near the regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
; '''syntax tree''' : shows you the inner structure of regular expressions
; '''explaining graph''' : shows you how you expression will work in a graphical way
; '''description''' : formulates the meaning of your expression in English
; '''testing tool''' : allows you to enter strings and see how they match your regexes

INSTALLATION NOTE. To have ''syntax tree'' and ''explaining graph'' tools working you need to have the [http://www.graphviz.org/Graphviz] package installed on your server and fill 'pathtodot' setting on you Moodle installation at Site Administration > Server > System Paths. It is used by authoring tools code to draw pictures for you.

===Regular expression area===
There you can enter (or edit) regular expression and refresh all the tools when done.

TODO You can also select a part of regular expression and it will be mapped on the tree, graph and description.

===Syntax tree===
As was said above, regular expressions, like all expressions, are trees of operators and operands. Syntax tree shows the inner structure of expression graphically: what is inside what. This will be the most useful if you know how to understand regular expressions or [[#Understanding regular expressions|learning to do this]].

If you don't understand operators and precedence conception well, it may have a small meaning to you. But it is still useful to find out, where you need parentheses: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown by dotted part of the tree.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

===Explaining graph===
The graph shows how regular expression works. Its nodes are matched characters, its edges show paths throught the nodes from the beginning to the end.
[[Image:qtype preg authortools4.png|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes are shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers which part of the string matched each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember subexpressions, you may use (?: ) instead of ( ) parentheses, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

TODO Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work.

===Testing tool===
You can enter a set of strings there, one per line. These strings will be matched against your expression. You'll see a coloured strings, showing which parts of your strings matched the expression, so you can test if it works as you expcted.

TODO The strings will be saved in database, if you save regex (they will be lost if you close window with "cancel" button).

==Understanding regular expressions==

===Understanding expressions in general===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or a sets of characters, that is allowed in particular position. '''A''' is a regular expressions that matches a single character 'A'. The '''operators''' in regular expressions define a way to combine individual characters in the pattern: sequence (''concatenation'' operator), alternative and repeating (it is called ''quantifier''). The concatenation is so simple operator, that it doesn't have any character for it at all - just write some characters in sequence, and they'll be concatenated. But it is still have precedence, so that the question can see, did you want to repeat a single character or a sequence of them. Alternative is written as vertical bar. There are many form of quantifiers - most commonly used are question mark (repeat zero or one times), asterisk (zero or more times) and plus (one or more times). You may specify mininimum and maximum number of repeats in curly braces - this is a quantifier too.

The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

Character classes allows you to specify several possible characters for one place. They can be defined in many different ways: by enumeration of characters in square brackets '''[as3]''', by ranges in square brackets '''[a-z]''', by special sequences ('''\d''' means any digit, '''\W''' anything except a letter, digit and underscore, '''<nowiki>[[:alpha:]]</nowiki>''' any letter etc). An important type of operand is a ''simple assertions'': they allow you to test some conditions - start of the string '''^''', end of the string '''$''' or word border '''\b'''.

You could find a list and more examples of operands and operators in [[#Regular expressions reference|reference]] section.

===Precedence and order of evaluation===
A '''quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "many times*" matches "manytime" followed by zero or more "s";
#* "(many times)*" matches "many times" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "first|second|third" matches "first" or "second" or "third";
#* "(first |second |)part" matches "first part" or "second part" or just "part" - typical use of an empty alternative (note that space is in alternative to not require it before just "part");
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "first|second*" matches "first" or "secon" followed by zero or more "d" like "secondddddd";
#* "(first|second)*" matches "first" or "second", repeated zero or more time in any order, like "firstsecondfirstfirst". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(1|2){2}" matches "11" or "12" or "21" or "22", not just "11" or "22";
#* "1{2}|2{2}" matches "11" or "22" only.

An internal structure of regular expression can be viewed well on the [[#Syntax tree|syntax tree]] (authoring tool). The operators that executed first are placed lower on the tree (or to the right on horizontal view), the operator that executed last is the root of the tree. You can compare tree and explaining graphs for the examples above in authoring tools if this section doesn't seems too clear to you. Remember, that "execution" of regular expression operator means linking them in the string: sequental, alternative linking, or repeating.

===Anchoring===
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

==Regular expressions reference==

===Operands===
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#* full list of characters needs escaping '''\ ^ $ . [ ] | ( ) ? * + { }'''
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

===Operators===
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

===Subpatterns and backreferences===
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

====Duplicate subpattern numbers and names====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

===Complex assertions===
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

===Local case-sensitivity modifiers===
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Error reporting===
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

==The ways to give back==
This software is considered scientific project and as such needs help any scientific project can use:
* an evidence, that results of our work (i.e. Preg questoin type) is really useful to people and was used in production environment;
* a cooperative work to research it's effectiveness for various applications - basically you need to write about how you used this question type and make some survey with you teachers and/or students about it - but it can include co-authoring a conference thesis or journal article;
* cooperating in writing article or help in publishing it in english-language journals (information and help in grants for further work is welcome too).

If you consider any way of helping, do not hesitate to write me about it and ask any questions about details. You may receive individual help during such work too (for example, doing cooperative research I may give you tips how to improve you regexes, etc).

I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts - there are many settings in the question, and regex can be quite complex, so it's hard to do all testing by developers themselves.

==Development plans==
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Improve a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2013-08-28T20:25:48Z

Oasychev: /* Regular expressions */

{{Questions}}Preg is a question type that uses regular expressions (regexes) to check student's responses (though you can use it without regexes for its hinting features). Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. [[#Ways to use Preg questions and this docs|First section]] should guide you to using of this docs, please use it with discretion. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm. There are many good regex manuals, I'm not going to repeat them here.

Authors:
# Idea, design, question type and behaviours code, hinting and error reporting - Oleg Sychev.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore and unicode support - Valeriy Streltsov.
# DFA regex matching engine - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Explaining tree, regular expression testing (authoring tools) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type.

==Ways to use Preg questions and this docs==
A little foreword: regardless of the way you use Preg and your capabilities, you could aways [[#The ways to give back|give back]]. It is hard to get any feedback to freely distributed software. But you shouldn't expect to get software which ideally suits you needs without telling anyone about these needs, or encouragement, or some non-difficult support to the authors. Sometimes as little as writing where you work and how you use (or what prevents you from using) Preg question type may help a lot.

===I don't (want to) know anything about regular expressions but next word (character) hinting seems useful===
Then you can use Preg question type just as Shortanswer with advanced hinting, without any knowledge about regular expressions. To do this, you need to choose
* '''Notation''' => Moodle shortanswer
* '''Engine''' => Non-deterministic finite state automata
* '''Exact matching''' => Yes

After that, you can just copy answers from you shortanswer questions. You may want to read the section about [[#Hinting|hinting]] to understand more about hinting settings.

===I have a vague knowledge of regular expressions, but want to use pattern matching===
If writing regular expressions is hard for you, but you want to use their strength as patterns, authoring tools may help you a lot to create your questions. The tools show you the meaning of your regex in different ways: internal structure of the expression (syntax tree), visual path of matching (explaining graph) and a text description. They also allow you to test you regex against several strings and see if it works as expected. Experiment and play with your regexes, see corresponding changes in the authoring tools, and eventually you'll get the regex you want.

Read the section on [[#Authoring tools|authoring tools]], than (probably after some experimenting with tools on your own) a start of section about [[#Understanding regular expressions|understanding regular expressions]] (this is optional, but may be interesting and help a lot). You should also read a section about [[#How Preg questions work|question working]] to better understand various settings and how they affects you questions.

===I can make some effort to learn regular expressions well and be able to do anything they allow===
Well, you don't know regexes but want to understand them and create complex expressions easily. Then, instad of blunt trying, you better spend some time and effort reading and understanding [[#Understanding regular expressions|this section]]. Then read slightly about [[#Authoring tools|authoring tools]] and use them to experiment creating regexes. With these tools you can see if you really understand them well and they behave as expected. Syntax tree may be especially useful when you try to get the right meaning of ''precedence'' and ''arity''. After you understand the principles of regexes well, read sections about [[#How Preg questions work|question working]] and [[#Regular expressions reference|regular expression reference]] (to know your possibilities, don't bother to understand or remember them all - just look there periodically for something new to learn). Now you should be able to write regexes without much use of authoring tools, except the testing tool to test your expressions.

===I know regular expressions well enought to write them on my own without further guidance===
You should read about [[#How Preg questions work|question working]] to understand various settings and question behaviour under them. You also may be interested in regex testing in the [[#Authoring tools|authoring tools]] section. Finally, [[#Regular expressions reference|regular expression reference]] may be of some use for you.

==How Preg questions work==
Basically, this question type is an extended version of Shortanswer. It extends its features in several different ways (you could use them in almost any combination):
* '''Pattern matching''' - using regular expressions you can create powerful patterns describing possible students answers
* '''Hinting''' - when students are stuck doing the question, you may allow them to ask for a next correct word (lexem) or a character (with possible penalty)

===Settings affecting question work===
Sets the case sensitivity for all regular expressions you specify as answers. Note that you can also [[#Local case-sensitivity modifiers|set the case sensitivity for regex parts]].

'''Exact matching''' affects the question in the following way:
; ''Yes'' : the ''entire'' student's response, from the first to the last letter, should match your regular expression
; ''No'' : student's response can just contain a ''part'' that matches your regex: for example, if the correct answer is "whole" then "the whole string" will be a correct student response

You still can set some of your regexes to match the whole student's response using [[#Anchoring|special regex syntax]].

'''Notations''' specify the "language" of your answers.
; ''Regular expression'' : a usual notation for regular expression. Precisely it is Perl-compatible regex dialect. You may write regex on multiple strings for better reading - line breaks will be ignored.
; ''Regular expression (extended)'' : useful for really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not inside character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything after '#' character untill the end of string is treated as commentary (# should not be escaped and should not be inside a character class).
; ''Moodle shortanswer'' : use it to avoid regex syntax at all: just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting features. You can skip all that is said on regexes there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

'''Matching engine''' specifies the program module that performs the regex matching. There is no 'best' matching engine - it depends on the features you want to use. Engines have different stability and offer different features to use.

; ''PHP preg extension'' : should be used when you '''don't need hinting''' and '''other engines are rejecting you expressions''' as too difficult or you encounter bugs in them. It is based on the native PHP preg_ functions. It supports 100% perl-compatible regex features, it is very stable and thoroughly tested. But it doesn't support partial matching, so (unless we storm PHP developers to add support of partial matching) there is '''no hinting'''. However it supports subpattern capturing. Choose it when you need complex regex features that other engines don't support.
; ''Non-deterministing finite state automata(NFA)'' : can be used to '''perform hinting''' for your students. NFA engine is a custom PHP code, it allows many (but not all) regex features and is thoroughly tested (it passes all tests from AT&T testregex suite and most tests from PCRE testinput1 suite for the features it supports, which means quite much), but still may contain bugs in rare cases. Unsupported features for now are lookaround assertions, recursion and conditional subpatterns.
; ''Deterministic finite state automata (DFA)'' : WARNING - this engine lacking support the past year. Use NFA engine instead if you can (i.e. don't get rejected regexes that this engine accepts), it allows almost anything DFA engine does, but NFA engine much more tested and stable.

===Hinting===
NFA and DFA matching engines support hinting in the adaptive and interactive behaviours.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion is unanchored ("Exact match" is set to "No") so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

====General hinting rules====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not usually a desirable behaviour.
When possible, hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with graeds from 50% to 100% will be used for hinting and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match with the student response.

====Next character hinting====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

====Next lexem (word) hinting====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language they are a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports these languages (but there will be more):
; ''simple english'' : english language scanner recognize words, numbers and punctuation marks;
; ''C/C++ language'' : a programming language C (or C++);
; ''prinf language'' : a special language for formatting strings in C/C++ programming language, you will have it disabled probably.

Administrator of the site can control what languages are available to the teachers, to avoid confusion. See the settings of the block "Formal languages" in the plugin settings menu.

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. Each language define their own word for it. You can enter another word in the question description, if you don't like default ones.

===Subpattern capturing and feedback===
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' will be replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work in the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do this, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Looking for missing and misplaced things===
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's not problem but a feature.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and its parts), and test it. Authoring tools are activated by pressing the "edit" icon near the regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
; '''syntax tree''' : shows you the inner structure of regular expressions
; '''explaining graph''' : shows you how you expression will work in a graphical way
; '''description''' : formulates the meaning of your expression in English
; '''testing tool''' : allows you to enter strings and see how they match your regexes

INSTALLATION NOTE. To have ''syntax tree'' and ''explaining graph'' tools working you need to have the [http://www.graphviz.org/Graphviz] package installed on your server and fill 'pathtodot' setting on you Moodle installation at Site Administration > Server > System Paths. It is used by authoring tools code to draw pictures for you.

===Regular expression area===
There you can enter (or edit) regular expression and refresh all the tools when done.

TODO You can also select a part of regular expression and it will be mapped on the tree, graph and description.

===Syntax tree===
As was said above, regular expressions, like all expressions, are trees of operators and operands. Syntax tree shows the inner structure of expression graphically: what is inside what. This will be the most useful if you know how to understand regular expressions or [[#Understanding regular expressions|learning to do this]].

If you don't understand operators and precedence conception well, it may have a small meaning to you. But it is still useful to find out, where you need parentheses: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown by dotted part of the tree.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

===Explaining graph===
The graph shows how regular expression works. Its nodes are matched characters, its edges show paths throught the nodes from the beginning to the end.
[[Image:qtype preg authortools4.png|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes are shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers which part of the string matched each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember subexpressions, you may use (?: ) instead of ( ) parentheses, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

TODO Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work.

===Testing tool===
You can enter a set of strings there, one per line. These strings will be matched against your expression. You'll see a coloured strings, showing which parts of your strings matched the expression, so you can test if it works as you expcted.

TODO The strings will be saved in database, if you save regex (they will be lost if you close window with "cancel" button).

==Understanding regular expressions==

===Understanding expressions in general===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or a sets of characters, that is allowed in particular position. '''A''' is a regular expressions that matches a single character 'A'. The '''operators''' in regular expressions define a way to combine individual characters in the pattern: sequence (''concatenation'' operator), alternative and repeating (it is called ''quantifier''). The concatenation is so simple operator, that it doesn't have any character for it at all - just write some characters in sequence, and they'll be concatenated. But it is still have precedence, so that the question can see, did you want to repeat a single character or a sequence of them. Alternative is written as vertical bar. There are many form of quantifiers - most commonly used are question mark (repeat zero or one times), asterisk (zero or more times) and plus (one or more times). You may specify mininimum and maximum number of repeats in curly braces - this is a quantifier too.

The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

Character classes allows you to specify several possible characters for one place. They can be defined in many different ways: by enumeration of characters in square brackets '''[as3]''', by ranges in square brackets '''[a-z]''', by special sequences ('''\d''' means any digit, '''\W''' anything except a letter, digit and underscore, '''<nowiki>[[:alpha:]]</nowiki>''' any letter etc). An important type of operand is a ''simple assertions'': they allow you to test some conditions - start of the string '''^''', end of the string '''$''' or word border '''\b'''.

You could find a list and more examples of operands and operators in [[#Regular expressions reference|reference]] section.

===Precedence and order of evaluation===
A '''quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or just "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

An internal structure of regular expression can be viewed well on the [[#Syntax tree|syntax tree]] (authoring tool). The operators that executed first are placed lower on the tree (or to the right on horizontal view), the operator that executed last is the root of the tree. You can compare tree and explaining graphs for the examples above in authoring tools if this section doesn't seems too clear to you. Remember, that "execution" of regular expression operator means linking them in the string: sequental, alternative linking, or repeating.

===Anchoring===
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

==Regular expressions reference==

===Operands===
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#* full list of characters needs escaping '''\ ^ $ . [ ] | ( ) ? * + { }'''
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

===Operators===
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

===Subpatterns and backreferences===
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

====Duplicate subpattern numbers and names====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

===Complex assertions===
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

===Local case-sensitivity modifiers===
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Error reporting===
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

==The ways to give back==
This software is considered scientific project and as such needs help any scientific project can use:
* an evidence, that results of our work (i.e. Preg questoin type) is really useful to people and was used in production environment;
* a cooperative work to research it's effectiveness for various applications - basically you need to write about how you used this question type and make some survey with you teachers and/or students about it - but it can include co-authoring a conference thesis or journal article;
* cooperating in writing article or help in publishing it in english-language journals (information and help in grants for further work is welcome too).

If you consider any way of helping, do not hesitate to write me about it and ask any questions about details. You may receive individual help during such work too (for example, doing cooperative research I may give you tips how to improve you regexes, etc).

I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts - there are many settings in the question, and regex can be quite complex, so it's hard to do all testing by developers themselves.

==Development plans==
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Improve a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2013-08-28T20:23:15Z

Oasychev: /* Settings affecting question work */

{{Questions}}Preg is a question type that uses regular expressions (regexes) to check student's responses (though you can use it without regexes for its hinting features). Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. [[#Ways to use Preg questions and this docs|First section]] should guide you to using of this docs, please use it with discretion. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm. There are many good regex manuals, I'm not going to repeat them here.

Authors:
# Idea, design, question type and behaviours code, hinting and error reporting - Oleg Sychev.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore and unicode support - Valeriy Streltsov.
# DFA regex matching engine - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Explaining tree, regular expression testing (authoring tools) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type.

==Ways to use Preg questions and this docs==
A little foreword: regardless of the way you use Preg and your capabilities, you could aways [[#The ways to give back|give back]]. It is hard to get any feedback to freely distributed software. But you shouldn't expect to get software which ideally suits you needs without telling anyone about these needs, or encouragement, or some non-difficult support to the authors. Sometimes as little as writing where you work and how you use (or what prevents you from using) Preg question type may help a lot.

===I don't (want to) know anything about regular expressions but next word (character) hinting seems useful===
Then you can use Preg question type just as Shortanswer with advanced hinting, without any knowledge about regular expressions. To do this, you need to choose
* '''Notation''' => Moodle shortanswer
* '''Engine''' => Non-deterministic finite state automata
* '''Exact matching''' => Yes

After that, you can just copy answers from you shortanswer questions. You may want to read the section about [[#Hinting|hinting]] to understand more about hinting settings.

===I have a vague knowledge of regular expressions, but want to use pattern matching===
If writing regular expressions is hard for you, but you want to use their strength as patterns, authoring tools may help you a lot to create your questions. The tools show you the meaning of your regex in different ways: internal structure of the expression (syntax tree), visual path of matching (explaining graph) and a text description. They also allow you to test you regex against several strings and see if it works as expected. Experiment and play with your regexes, see corresponding changes in the authoring tools, and eventually you'll get the regex you want.

Read the section on [[#Authoring tools|authoring tools]], than (probably after some experimenting with tools on your own) a start of section about [[#Understanding regular expressions|understanding regular expressions]] (this is optional, but may be interesting and help a lot). You should also read a section about [[#How Preg questions work|question working]] to better understand various settings and how they affects you questions.

===I can make some effort to learn regular expressions well and be able to do anything they allow===
Well, you don't know regexes but want to understand them and create complex expressions easily. Then, instad of blunt trying, you better spend some time and effort reading and understanding [[#Understanding regular expressions|this section]]. Then read slightly about [[#Authoring tools|authoring tools]] and use them to experiment creating regexes. With these tools you can see if you really understand them well and they behave as expected. Syntax tree may be especially useful when you try to get the right meaning of ''precedence'' and ''arity''. After you understand the principles of regexes well, read sections about [[#How Preg questions work|question working]] and [[#Regular expressions reference|regular expression reference]] (to know your possibilities, don't bother to understand or remember them all - just look there periodically for something new to learn). Now you should be able to write regexes without much use of authoring tools, except the testing tool to test your expressions.

===I know regular expressions well enought to write them on my own without further guidance===
You should read about [[#How Preg questions work|question working]] to understand various settings and question behaviour under them. You also may be interested in regex testing in the [[#Authoring tools|authoring tools]] section. Finally, [[#Regular expressions reference|regular expression reference]] may be of some use for you.

==How Preg questions work==
Basically, this question type is an extended version of Shortanswer. It extends its features in several different ways (you could use them in almost any combination):
* '''Pattern matching''' - using regular expressions you can create powerful patterns describing possible students answers
* '''Hinting''' - when students are stuck doing the question, you may allow them to ask for a next correct word (lexem) or a character (with possible penalty)

===Settings affecting question work===
Sets the case sensitivity for all regular expressions you specify as answers. Note that you can also [[#Local case-sensitivity modifiers|set the case sensitivity for regex parts]].

'''Exact matching''' affects the question in the following way:
; ''Yes'' : the ''entire'' student's response, from the first to the last letter, should match your regular expression
; ''No'' : student's response can just contain a ''part'' that matches your regex: for example, if the correct answer is "whole" then "the whole string" will be a correct student response

You still can set some of your regexes to match the whole student's response using [[#Anchoring|special regex syntax]].

'''Notations''' specify the "language" of your answers.
; ''Regular expression'' : a usual notation for regular expression. Precisely it is Perl-compatible regex dialect. You may write regex on multiple strings for better reading - line breaks will be ignored.
; ''Regular expression (extended)'' : useful for really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not inside character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything after '#' character untill the end of string is treated as commentary (# should not be escaped and should not be inside a character class).
; ''Moodle shortanswer'' : use it to avoid regex syntax at all: just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting features. You can skip all that is said on regexes there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

'''Matching engine''' specifies the program module that performs the regex matching. There is no 'best' matching engine - it depends on the features you want to use. Engines have different stability and offer different features to use.

; ''PHP preg extension'' : should be used when you '''don't need hinting''' and '''other engines are rejecting you expressions''' as too difficult or you encounter bugs in them. It is based on the native PHP preg_ functions. It supports 100% perl-compatible regex features, it is very stable and thoroughly tested. But it doesn't support partial matching, so (unless we storm PHP developers to add support of partial matching) there is '''no hinting'''. However it supports subpattern capturing. Choose it when you need complex regex features that other engines don't support.
; ''Non-deterministing finite state automata(NFA)'' : can be used to '''perform hinting''' for your students. NFA engine is a custom PHP code, it allows many (but not all) regex features and is thoroughly tested (it passes all tests from AT&T testregex suite and most tests from PCRE testinput1 suite for the features it supports, which means quite much), but still may contain bugs in rare cases. Unsupported features for now are lookaround assertions, recursion and conditional subpatterns.
; ''Deterministic finite state automata (DFA)'' : WARNING - this engine lacking support the past year. Use NFA engine instead if you can (i.e. don't get rejected regexes that this engine accepts), it allows almost anything DFA engine does, but NFA engine much more tested and stable.

===Hinting===
NFA and DFA matching engines support hinting in the adaptive and interactive behaviours.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion is unanchored ("Exact match" is set to "No") so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

====General hinting rules====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not usually a desirable behaviour.
When possible, hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with graeds from 50% to 100% will be used for hinting and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match with the student response.

====Next character hinting====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

====Next lexem (word) hinting====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language they are a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports these languages (but there will be more):
; ''simple english'' : english language scanner recognize words, numbers and punctuation marks;
; ''C/C++ language'' : a programming language C (or C++);
; ''prinf language'' : a special language for formatting strings in C/C++ programming language, you will have it disabled probably.

Administrator of the site can control what languages are available to the teachers, to avoid confusion. See the settings of the block "Formal languages" in the plugin settings menu.

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. Each language define their own word for it. You can enter another word in the question description, if you don't like default ones.

===Subpattern capturing and feedback===
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' will be replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work in the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do this, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Looking for missing and misplaced things===
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's not problem but a feature.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and its parts), and test it. Authoring tools are activated by pressing the "edit" icon near the regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
; '''syntax tree''' : shows you the inner structure of regular expressions
; '''explaining graph''' : shows you how you expression will work in a graphical way
; '''description''' : formulates the meaning of your expression in English
; '''testing tool''' : allows you to enter strings and see how they match your regexes

INSTALLATION NOTE. To have ''syntax tree'' and ''explaining graph'' tools working you need to have the [http://www.graphviz.org/Graphviz] package installed on your server and fill 'pathtodot' setting on you Moodle installation at Site Administration > Server > System Paths. It is used by authoring tools code to draw pictures for you.

===Regular expression area===
There you can enter (or edit) regular expression and refresh all the tools when done.

TODO You can also select a part of regular expression and it will be mapped on the tree, graph and description.

===Syntax tree===
As was said above, regular expressions, like all expressions, are trees of operators and operands. Syntax tree shows the inner structure of expression graphically: what is inside what. This will be the most useful if you know how to understand regular expressions or [[#Understanding regular expressions|learning to do this]].

If you don't understand operators and precedence conception well, it may have a small meaning to you. But it is still useful to find out, where you need parentheses: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown by dotted part of the tree.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

===Explaining graph===
The graph shows how regular expression works. Its nodes are matched characters, its edges show paths throught the nodes from the beginning to the end.
[[Image:qtype preg authortools4.png|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes are shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers which part of the string matched each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember subexpressions, you may use (?: ) instead of ( ) parentheses, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

TODO Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work.

===Testing tool===
You can enter a set of strings there, one per line. These strings will be matched against your expression. You'll see a coloured strings, showing which parts of your strings matched the expression, so you can test if it works as you expcted.

TODO The strings will be saved in database, if you save regex (they will be lost if you close window with "cancel" button).

==Understanding regular expressions==

===Understanding expressions in general===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or a sets of characters, that is allowed in particular position. '''A''' is a regular expressions that matches a single character 'A'. The '''operators''' in regular expressions define a way to combine individual characters in the pattern: sequence (''concatenation'' operator), alternative and repeating (it is called ''quantifier''). The concatenation is so simple operator, that it doesn't have any character for it at all - just write some characters in sequence, and they'll be concatenated. But it is still have precedence, so that the question can see, did you want to repeat a single character or a sequence of them. Alternative is written as vertical bar. There are many form of quantifiers - most commonly used are question mark (repeat zero or one times), asterisk (zero or more times) and plus (one or more times). You may specify mininimum and maximum number of repeats in curly braces - this is a quantifier too.

The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

Character classes allows you to specify several possible characters for one place. They can be defined in many different ways: by enumeration of characters in square brackets '''[as3]''', by ranges in square brackets '''[a-z]''', by special sequences ('''\d''' means any digit, '''\W''' anything except a letter, digit and underscore, [[:alpha:]] any letter etc). An important type of operand is a ''simple assertions'': they allow you to test some conditions - start of the string '''^''', end of the string '''$''' or word border '''\b'''.

You could find a list and more examples of operands and operators in [[#Regular expressions reference|reference]] section.

===Precedence and order of evaluation===
A '''quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or just "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

An internal structure of regular expression can be viewed well on the [[#Syntax tree|syntax tree]] (authoring tool). The operators that executed first are placed lower on the tree (or to the right on horizontal view), the operator that executed last is the root of the tree. You can compare tree and explaining graphs for the examples above in authoring tools if this section doesn't seems too clear to you. Remember, that "execution" of regular expression operator means linking them in the string: sequental, alternative linking, or repeating.

===Anchoring===
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

==Regular expressions reference==

===Operands===
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#* full list of characters needs escaping '''\ ^ $ . [ ] | ( ) ? * + { }'''
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

===Operators===
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

===Subpatterns and backreferences===
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

====Duplicate subpattern numbers and names====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

===Complex assertions===
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

===Local case-sensitivity modifiers===
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Error reporting===
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

==The ways to give back==
This software is considered scientific project and as such needs help any scientific project can use:
* an evidence, that results of our work (i.e. Preg questoin type) is really useful to people and was used in production environment;
* a cooperative work to research it's effectiveness for various applications - basically you need to write about how you used this question type and make some survey with you teachers and/or students about it - but it can include co-authoring a conference thesis or journal article;
* cooperating in writing article or help in publishing it in english-language journals (information and help in grants for further work is welcome too).

If you consider any way of helping, do not hesitate to write me about it and ask any questions about details. You may receive individual help during such work too (for example, doing cooperative research I may give you tips how to improve you regexes, etc).

I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts - there are many settings in the question, and regex can be quite complex, so it's hard to do all testing by developers themselves.

==Development plans==
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Improve a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2013-08-28T20:22:39Z

Oasychev: /* Settings affecting question work */

{{Questions}}Preg is a question type that uses regular expressions (regexes) to check student's responses (though you can use it without regexes for its hinting features). Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. [[#Ways to use Preg questions and this docs|First section]] should guide you to using of this docs, please use it with discretion. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm. There are many good regex manuals, I'm not going to repeat them here.

Authors:
# Idea, design, question type and behaviours code, hinting and error reporting - Oleg Sychev.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore and unicode support - Valeriy Streltsov.
# DFA regex matching engine - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Explaining tree, regular expression testing (authoring tools) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type.

==Ways to use Preg questions and this docs==
A little foreword: regardless of the way you use Preg and your capabilities, you could aways [[#The ways to give back|give back]]. It is hard to get any feedback to freely distributed software. But you shouldn't expect to get software which ideally suits you needs without telling anyone about these needs, or encouragement, or some non-difficult support to the authors. Sometimes as little as writing where you work and how you use (or what prevents you from using) Preg question type may help a lot.

===I don't (want to) know anything about regular expressions but next word (character) hinting seems useful===
Then you can use Preg question type just as Shortanswer with advanced hinting, without any knowledge about regular expressions. To do this, you need to choose
* '''Notation''' => Moodle shortanswer
* '''Engine''' => Non-deterministic finite state automata
* '''Exact matching''' => Yes

After that, you can just copy answers from you shortanswer questions. You may want to read the section about [[#Hinting|hinting]] to understand more about hinting settings.

===I have a vague knowledge of regular expressions, but want to use pattern matching===
If writing regular expressions is hard for you, but you want to use their strength as patterns, authoring tools may help you a lot to create your questions. The tools show you the meaning of your regex in different ways: internal structure of the expression (syntax tree), visual path of matching (explaining graph) and a text description. They also allow you to test you regex against several strings and see if it works as expected. Experiment and play with your regexes, see corresponding changes in the authoring tools, and eventually you'll get the regex you want.

Read the section on [[#Authoring tools|authoring tools]], than (probably after some experimenting with tools on your own) a start of section about [[#Understanding regular expressions|understanding regular expressions]] (this is optional, but may be interesting and help a lot). You should also read a section about [[#How Preg questions work|question working]] to better understand various settings and how they affects you questions.

===I can make some effort to learn regular expressions well and be able to do anything they allow===
Well, you don't know regexes but want to understand them and create complex expressions easily. Then, instad of blunt trying, you better spend some time and effort reading and understanding [[#Understanding regular expressions|this section]]. Then read slightly about [[#Authoring tools|authoring tools]] and use them to experiment creating regexes. With these tools you can see if you really understand them well and they behave as expected. Syntax tree may be especially useful when you try to get the right meaning of ''precedence'' and ''arity''. After you understand the principles of regexes well, read sections about [[#How Preg questions work|question working]] and [[#Regular expressions reference|regular expression reference]] (to know your possibilities, don't bother to understand or remember them all - just look there periodically for something new to learn). Now you should be able to write regexes without much use of authoring tools, except the testing tool to test your expressions.

===I know regular expressions well enought to write them on my own without further guidance===
You should read about [[#How Preg questions work|question working]] to understand various settings and question behaviour under them. You also may be interested in regex testing in the [[#Authoring tools|authoring tools]] section. Finally, [[#Regular expressions reference|regular expression reference]] may be of some use for you.

==How Preg questions work==
Basically, this question type is an extended version of Shortanswer. It extends its features in several different ways (you could use them in almost any combination):
* '''Pattern matching''' - using regular expressions you can create powerful patterns describing possible students answers
* '''Hinting''' - when students are stuck doing the question, you may allow them to ask for a next correct word (lexem) or a character (with possible penalty)

===Settings affecting question work===
Sets the case sensitivity for all regular expressions you specify as answers. Note that you can also [[#Local case-sensitivity modifiers|set the case sensitivity for regex parts]].

'''Exact matching''' affects the question in the following way:
; ''Yes'' : the ''entire'' student's response, from the first to the last letter, should match your regular expression
; ''No'' : student's response can just contain a ''part'' that matches your regex: for example, if the correct answer is "whole" then "the whole regex" will be a correct student response

You still can set some of your regexes to match the whole student's response using [[#Anchoring|special regex syntax]].

'''Notations''' specify the "language" of your answers.
; ''Regular expression'' : a usual notation for regular expression. Precisely it is Perl-compatible regex dialect. You may write regex on multiple strings for better reading - line breaks will be ignored.
; ''Regular expression (extended)'' : useful for really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not inside character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything after '#' character untill the end of string is treated as commentary (# should not be escaped and should not be inside a character class).
; ''Moodle shortanswer'' : use it to avoid regex syntax at all: just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting features. You can skip all that is said on regexes there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

'''Matching engine''' specifies the program module that performs the regex matching. There is no 'best' matching engine - it depends on the features you want to use. Engines have different stability and offer different features to use.

; ''PHP preg extension'' : should be used when you '''don't need hinting''' and '''other engines are rejecting you expressions''' as too difficult or you encounter bugs in them. It is based on the native PHP preg_ functions. It supports 100% perl-compatible regex features, it is very stable and thoroughly tested. But it doesn't support partial matching, so (unless we storm PHP developers to add support of partial matching) there is '''no hinting'''. However it supports subpattern capturing. Choose it when you need complex regex features that other engines don't support.
; ''Non-deterministing finite state automata(NFA)'' : can be used to '''perform hinting''' for your students. NFA engine is a custom PHP code, it allows many (but not all) regex features and is thoroughly tested (it passes all tests from AT&T testregex suite and most tests from PCRE testinput1 suite for the features it supports, which means quite much), but still may contain bugs in rare cases. Unsupported features for now are lookaround assertions, recursion and conditional subpatterns.
; ''Deterministic finite state automata (DFA)'' : WARNING - this engine lacking support the past year. Use NFA engine instead if you can (i.e. don't get rejected regexes that this engine accepts), it allows almost anything DFA engine does, but NFA engine much more tested and stable.

===Hinting===
NFA and DFA matching engines support hinting in the adaptive and interactive behaviours.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion is unanchored ("Exact match" is set to "No") so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

====General hinting rules====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not usually a desirable behaviour.
When possible, hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with graeds from 50% to 100% will be used for hinting and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match with the student response.

====Next character hinting====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

====Next lexem (word) hinting====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language they are a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports these languages (but there will be more):
; ''simple english'' : english language scanner recognize words, numbers and punctuation marks;
; ''C/C++ language'' : a programming language C (or C++);
; ''prinf language'' : a special language for formatting strings in C/C++ programming language, you will have it disabled probably.

Administrator of the site can control what languages are available to the teachers, to avoid confusion. See the settings of the block "Formal languages" in the plugin settings menu.

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. Each language define their own word for it. You can enter another word in the question description, if you don't like default ones.

===Subpattern capturing and feedback===
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' will be replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work in the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do this, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Looking for missing and misplaced things===
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's not problem but a feature.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and its parts), and test it. Authoring tools are activated by pressing the "edit" icon near the regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
; '''syntax tree''' : shows you the inner structure of regular expressions
; '''explaining graph''' : shows you how you expression will work in a graphical way
; '''description''' : formulates the meaning of your expression in English
; '''testing tool''' : allows you to enter strings and see how they match your regexes

INSTALLATION NOTE. To have ''syntax tree'' and ''explaining graph'' tools working you need to have the [http://www.graphviz.org/Graphviz] package installed on your server and fill 'pathtodot' setting on you Moodle installation at Site Administration > Server > System Paths. It is used by authoring tools code to draw pictures for you.

===Regular expression area===
There you can enter (or edit) regular expression and refresh all the tools when done.

TODO You can also select a part of regular expression and it will be mapped on the tree, graph and description.

===Syntax tree===
As was said above, regular expressions, like all expressions, are trees of operators and operands. Syntax tree shows the inner structure of expression graphically: what is inside what. This will be the most useful if you know how to understand regular expressions or [[#Understanding regular expressions|learning to do this]].

If you don't understand operators and precedence conception well, it may have a small meaning to you. But it is still useful to find out, where you need parentheses: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown by dotted part of the tree.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

===Explaining graph===
The graph shows how regular expression works. Its nodes are matched characters, its edges show paths throught the nodes from the beginning to the end.
[[Image:qtype preg authortools4.png|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes are shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers which part of the string matched each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember subexpressions, you may use (?: ) instead of ( ) parentheses, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

TODO Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work.

===Testing tool===
You can enter a set of strings there, one per line. These strings will be matched against your expression. You'll see a coloured strings, showing which parts of your strings matched the expression, so you can test if it works as you expcted.

TODO The strings will be saved in database, if you save regex (they will be lost if you close window with "cancel" button).

==Understanding regular expressions==

===Understanding expressions in general===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or a sets of characters, that is allowed in particular position. '''A''' is a regular expressions that matches a single character 'A'. The '''operators''' in regular expressions define a way to combine individual characters in the pattern: sequence (''concatenation'' operator), alternative and repeating (it is called ''quantifier''). The concatenation is so simple operator, that it doesn't have any character for it at all - just write some characters in sequence, and they'll be concatenated. But it is still have precedence, so that the question can see, did you want to repeat a single character or a sequence of them. Alternative is written as vertical bar. There are many form of quantifiers - most commonly used are question mark (repeat zero or one times), asterisk (zero or more times) and plus (one or more times). You may specify mininimum and maximum number of repeats in curly braces - this is a quantifier too.

The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

Character classes allows you to specify several possible characters for one place. They can be defined in many different ways: by enumeration of characters in square brackets '''[as3]''', by ranges in square brackets '''[a-z]''', by special sequences ('''\d''' means any digit, '''\W''' anything except a letter, digit and underscore, [[:alpha:]] any letter etc). An important type of operand is a ''simple assertions'': they allow you to test some conditions - start of the string '''^''', end of the string '''$''' or word border '''\b'''.

You could find a list and more examples of operands and operators in [[#Regular expressions reference|reference]] section.

===Precedence and order of evaluation===
A '''quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or just "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

An internal structure of regular expression can be viewed well on the [[#Syntax tree|syntax tree]] (authoring tool). The operators that executed first are placed lower on the tree (or to the right on horizontal view), the operator that executed last is the root of the tree. You can compare tree and explaining graphs for the examples above in authoring tools if this section doesn't seems too clear to you. Remember, that "execution" of regular expression operator means linking them in the string: sequental, alternative linking, or repeating.

===Anchoring===
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

==Regular expressions reference==

===Operands===
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#* full list of characters needs escaping '''\ ^ $ . [ ] | ( ) ? * + { }'''
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

===Operators===
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

===Subpatterns and backreferences===
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

====Duplicate subpattern numbers and names====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

===Complex assertions===
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

===Local case-sensitivity modifiers===
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Error reporting===
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

==The ways to give back==
This software is considered scientific project and as such needs help any scientific project can use:
* an evidence, that results of our work (i.e. Preg questoin type) is really useful to people and was used in production environment;
* a cooperative work to research it's effectiveness for various applications - basically you need to write about how you used this question type and make some survey with you teachers and/or students about it - but it can include co-authoring a conference thesis or journal article;
* cooperating in writing article or help in publishing it in english-language journals (information and help in grants for further work is welcome too).

If you consider any way of helping, do not hesitate to write me about it and ask any questions about details. You may receive individual help during such work too (for example, doing cooperative research I may give you tips how to improve you regexes, etc).

I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts - there are many settings in the question, and regex can be quite complex, so it's hard to do all testing by developers themselves.

==Development plans==
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Improve a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2013-08-28T20:17:34Z

Oasychev: /* How Preg questions work */ - исправление жирности и различных ошибок

{{Questions}}Preg is a question type that uses regular expressions (regexes) to check student's responses (though you can use it without regexes for its hinting features). Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. [[#Ways to use Preg questions and this docs|First section]] should guide you to using of this docs, please use it with discretion. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm. There are many good regex manuals, I'm not going to repeat them here.

Authors:
# Idea, design, question type and behaviours code, hinting and error reporting - Oleg Sychev.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore and unicode support - Valeriy Streltsov.
# DFA regex matching engine - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Explaining tree, regular expression testing (authoring tools) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type.

==Ways to use Preg questions and this docs==
A little foreword: regardless of the way you use Preg and your capabilities, you could aways [[#The ways to give back|give back]]. It is hard to get any feedback to freely distributed software. But you shouldn't expect to get software which ideally suits you needs without telling anyone about these needs, or encouragement, or some non-difficult support to the authors. Sometimes as little as writing where you work and how you use (or what prevents you from using) Preg question type may help a lot.

===I don't (want to) know anything about regular expressions but next word (character) hinting seems useful===
Then you can use Preg question type just as Shortanswer with advanced hinting, without any knowledge about regular expressions. To do this, you need to choose
* '''Notation''' => Moodle shortanswer
* '''Engine''' => Non-deterministic finite state automata
* '''Exact matching''' => Yes

After that, you can just copy answers from you shortanswer questions. You may want to read the section about [[#Hinting|hinting]] to understand more about hinting settings.

===I have a vague knowledge of regular expressions, but want to use pattern matching===
If writing regular expressions is hard for you, but you want to use their strength as patterns, authoring tools may help you a lot to create your questions. The tools show you the meaning of your regex in different ways: internal structure of the expression (syntax tree), visual path of matching (explaining graph) and a text description. They also allow you to test you regex against several strings and see if it works as expected. Experiment and play with your regexes, see corresponding changes in the authoring tools, and eventually you'll get the regex you want.

Read the section on [[#Authoring tools|authoring tools]], than (probably after some experimenting with tools on your own) a start of section about [[#Understanding regular expressions|understanding regular expressions]] (this is optional, but may be interesting and help a lot). You should also read a section about [[#How Preg questions work|question working]] to better understand various settings and how they affects you questions.

===I can make some effort to learn regular expressions well and be able to do anything they allow===
Well, you don't know regexes but want to understand them and create complex expressions easily. Then, instad of blunt trying, you better spend some time and effort reading and understanding [[#Understanding regular expressions|this section]]. Then read slightly about [[#Authoring tools|authoring tools]] and use them to experiment creating regexes. With these tools you can see if you really understand them well and they behave as expected. Syntax tree may be especially useful when you try to get the right meaning of ''precedence'' and ''arity''. After you understand the principles of regexes well, read sections about [[#How Preg questions work|question working]] and [[#Regular expressions reference|regular expression reference]] (to know your possibilities, don't bother to understand or remember them all - just look there periodically for something new to learn). Now you should be able to write regexes without much use of authoring tools, except the testing tool to test your expressions.

===I know regular expressions well enought to write them on my own without further guidance===
You should read about [[#How Preg questions work|question working]] to understand various settings and question behaviour under them. You also may be interested in regex testing in the [[#Authoring tools|authoring tools]] section. Finally, [[#Regular expressions reference|regular expression reference]] may be of some use for you.

==How Preg questions work==
Basically, this question type is an extended version of Shortanswer. It extends its features in several different ways (you could use them in almost any combination):
* '''Pattern matching''' - using regular expressions you can create powerful patterns describing possible students answers
* '''Hinting''' - when students are stuck doing the question, you may allow them to ask for a next correct word (lexem) or a character (with possible penalty)

===Settings affecting question work===
Sets the case sensitivity for all regular expressions you specify as answers. Note that you can also [[#Local case-sensitivity modifiers|set the case sensitivity for regex parts]].

'''Exact matching''' affects the question in the following way:
; ''Yes'' : the ''entire'' student's response, from the first untill the last letter, should match your regular expression
; ''No'' : student's response can just contain a ''part'' that matches your regex: for example, if the correct answer is "whole" then "the whole regex" will be a correct student response

You still can set some of your regexes to match the whole student's response using [[#Anchoring|special regex syntax]].

'''Notations''' specify the "language" of your answers.
; ''Regular expression'' : a usual notation for regular expression. Precisely it is Perl-compatible regex dialect. You may write regex on multiple strings for better reading - line breaks will be ignored.
; ''Regular expression (extended)'' : useful for really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not inside character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything after '#' character untill the end of string is treated as commentary (# should not be escaped and should not be inside a character class).
; ''Moodle shortanswer'' : use it to avoid regex syntax at all: just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting features. You can skip all that is said on regexes there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

'''Matching engine''' specifies the program module that performs the regex matching. There is no 'best' matching engine - it depends on the features you want to use. Engines have different stability and offer different features to use.

; ''PHP preg extension'' : should be used when you '''don't need hinting''' and '''other engines are rejecting you expressions''' as too difficult or you encounter bugs in them. It is based on the native PHP preg_ functions. It supports 100% perl-compatible regex features, it is very stable and thoroughly tested. But it doesn't support partial matching, so (unless we storm PHP developers to add support of partial matching) there is '''no hinting'''. However it supports subpattern capturing. Choose it when you need complex regex features that other engines don't support.
; ''Non-deterministing finite state automata(NFA)'' : can be used to '''perform hinting''' for your students. NFA engine is a custom PHP code, it allows many (but not all) regex features and is thoroughly tested (it passes all tests from AT&T testregex suite and most tests from PCRE testinput1 suite for the features it supports, which means quite much), but still may contain bugs in rare cases. Unsupported features for now are lookaround assertions, recursion and conditional subpatterns.
; ''Deterministic finite state automata (DFA)'' : WARNING - this engine lacking support the past year. Use NFA engine instead if you can (i.e. don't get rejected regexes that this engine accepts), it allows almost anything DFA engine does, but NFA engine much more tested and stable.

===Hinting===
NFA and DFA matching engines support hinting in the adaptive and interactive behaviours.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion is unanchored ("Exact match" is set to "No") so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

====General hinting rules====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not usually a desirable behaviour.
When possible, hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with graeds from 50% to 100% will be used for hinting and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match with the student response.

====Next character hinting====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

====Next lexem (word) hinting====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language they are a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports these languages (but there will be more):
; ''simple english'' : english language scanner recognize words, numbers and punctuation marks;
; ''C/C++ language'' : a programming language C (or C++);
; ''prinf language'' : a special language for formatting strings in C/C++ programming language, you will have it disabled probably.

Administrator of the site can control what languages are available to the teachers, to avoid confusion. See the settings of the block "Formal languages" in the plugin settings menu.

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. Each language define their own word for it. You can enter another word in the question description, if you don't like default ones.

===Subpattern capturing and feedback===
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' will be replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work in the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do this, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Looking for missing and misplaced things===
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's not problem but a feature.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and its parts), and test it. Authoring tools are activated by pressing the "edit" icon near the regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
; '''syntax tree''' : shows you the inner structure of regular expressions
; '''explaining graph''' : shows you how you expression will work in a graphical way
; '''description''' : formulates the meaning of your expression in English
; '''testing tool''' : allows you to enter strings and see how they match your regexes

INSTALLATION NOTE. To have ''syntax tree'' and ''explaining graph'' tools working you need to have the [http://www.graphviz.org/Graphviz] package installed on your server and fill 'pathtodot' setting on you Moodle installation at Site Administration > Server > System Paths. It is used by authoring tools code to draw pictures for you.

===Regular expression area===
There you can enter (or edit) regular expression and refresh all the tools when done.

TODO You can also select a part of regular expression and it will be mapped on the tree, graph and description.

===Syntax tree===
As was said above, regular expressions, like all expressions, are trees of operators and operands. Syntax tree shows the inner structure of expression graphically: what is inside what. This will be the most useful if you know how to understand regular expressions or [[#Understanding regular expressions|learning to do this]].

If you don't understand operators and precedence conception well, it may have a small meaning to you. But it is still useful to find out, where you need parentheses: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown by dotted part of the tree.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

===Explaining graph===
The graph shows how regular expression works. Its nodes are matched characters, its edges show paths throught the nodes from the beginning to the end.
[[Image:qtype preg authortools4.png|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes are shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers which part of the string matched each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember subexpressions, you may use (?: ) instead of ( ) parentheses, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

TODO Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work.

===Testing tool===
You can enter a set of strings there, one per line. These strings will be matched against your expression. You'll see a coloured strings, showing which parts of your strings matched the expression, so you can test if it works as you expcted.

TODO The strings will be saved in database, if you save regex (they will be lost if you close window with "cancel" button).

==Understanding regular expressions==

===Understanding expressions in general===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or a sets of characters, that is allowed in particular position. '''A''' is a regular expressions that matches a single character 'A'. The '''operators''' in regular expressions define a way to combine individual characters in the pattern: sequence (''concatenation'' operator), alternative and repeating (it is called ''quantifier''). The concatenation is so simple operator, that it doesn't have any character for it at all - just write some characters in sequence, and they'll be concatenated. But it is still have precedence, so that the question can see, did you want to repeat a single character or a sequence of them. Alternative is written as vertical bar. There are many form of quantifiers - most commonly used are question mark (repeat zero or one times), asterisk (zero or more times) and plus (one or more times). You may specify mininimum and maximum number of repeats in curly braces - this is a quantifier too.

The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

Character classes allows you to specify several possible characters for one place. They can be defined in many different ways: by enumeration of characters in square brackets '''[as3]''', by ranges in square brackets '''[a-z]''', by special sequences ('''\d''' means any digit, '''\W''' anything except a letter, digit and underscore, [[:alpha:]] any letter etc). An important type of operand is a ''simple assertions'': they allow you to test some conditions - start of the string '''^''', end of the string '''$''' or word border '''\b'''.

You could find a list and more examples of operands and operators in [[#Regular expressions reference|reference]] section.

===Precedence and order of evaluation===
A '''quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or just "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

An internal structure of regular expression can be viewed well on the [[#Syntax tree|syntax tree]] (authoring tool). The operators that executed first are placed lower on the tree (or to the right on horizontal view), the operator that executed last is the root of the tree. You can compare tree and explaining graphs for the examples above in authoring tools if this section doesn't seems too clear to you. Remember, that "execution" of regular expression operator means linking them in the string: sequental, alternative linking, or repeating.

===Anchoring===
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

==Regular expressions reference==

===Operands===
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#* full list of characters needs escaping '''\ ^ $ . [ ] | ( ) ? * + { }'''
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

===Operators===
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

===Subpatterns and backreferences===
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

====Duplicate subpattern numbers and names====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

===Complex assertions===
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

===Local case-sensitivity modifiers===
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Error reporting===
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

==The ways to give back==
This software is considered scientific project and as such needs help any scientific project can use:
* an evidence, that results of our work (i.e. Preg questoin type) is really useful to people and was used in production environment;
* a cooperative work to research it's effectiveness for various applications - basically you need to write about how you used this question type and make some survey with you teachers and/or students about it - but it can include co-authoring a conference thesis or journal article;
* cooperating in writing article or help in publishing it in english-language journals (information and help in grants for further work is welcome too).

If you consider any way of helping, do not hesitate to write me about it and ask any questions about details. You may receive individual help during such work too (for example, doing cooperative research I may give you tips how to improve you regexes, etc).

I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts - there are many settings in the question, and regex can be quite complex, so it's hard to do all testing by developers themselves.

==Development plans==
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Improve a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2013-08-23T20:18:47Z

Oasychev: /* Precedence and order of evaluation */ - added info about syntax tree and order of evaluation

{{Questions}}Preg is a question type that uses regular expressions (regexes) to check student's responses (though you can use it without regexes for its hinting features). Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. [[#Ways to use Preg questions and this docs|First section]] should guide you to using of this docs, please use it with discretion. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm. There are many good regex manuals, I'm not going to repeat them here.

Authors:
# Idea, design, question type and behaviours code, hinting and error reporting - Oleg Sychev.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore and unicode support - Valeriy Streltsov.
# DFA regex matching engine - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Explaining tree, regular expression testing (authoring tools) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type.

==Ways to use Preg questions and this docs==
A little foreword: regardless of the way you use Preg and your capabilities, you could aways [[#The ways to give back|give back]]. It is hard to get any feedback to freely distributed software. But you shouldn't expect to get software which ideally suits you needs without telling anyone about these needs, or encouragement, or some non-difficult support to the authors. Sometimes as little as writing where you work and how you use (or what prevents you from using) Preg question type may help a lot.

===I don't (want to) know anything about regular expressions but next word (character) hinting seems useful===
Then you can use Preg question type just as Shortanswer with advanced hinting, without any knowledge about regular expressions. To do this, you need to choose
* '''Notation''' => Moodle shortanswer
* '''Engine''' => Non-deterministic finite state automata
* '''Exact matching''' => Yes

After that, you can just copy answers from you shortanswer questions. You may want to read the section about [[#Hinting|hinting]] to understand more about hinting settings.

===I have a vague knowledge of regular expressions, but want to use pattern matching===
If writing regular expressions is hard for you, but you want to use their strength as patterns, authoring tools may help you a lot to create your questions. The tools show you the meaning of your regex in different ways: internal structure of the expression (syntax tree), visual path of matching (explaining graph) and a text description. They also allow you to test you regex against several strings and see if it works as expected. Experiment and play with your regexes, see corresponding changes in the authoring tools, and eventually you'll get the regex you want.

Read the section on [[#Authoring tools|authoring tools]], than (probably after some experimenting with tools on your own) a start of section about [[#Understanding regular expressions|understanding regular expressions]] (this is optional, but may be interesting and help a lot). You should also read a section about [[#How Preg questions work|question working]] to better understand various settings and how they affects you questions.

===I can make some effort to learn regular expressions well and be able to do anything they allow===
Well, you don't know regexes but want to understand them and create complex expressions easily. Then, instad of blunt trying, you better spend some time and effort reading and understanding [[#Understanding regular expressions|this section]]. Then read slightly about [[#Authoring tools|authoring tools]] and use them to experiment creating regexes. With these tools you can see if you really understand them well and they behave as expected. Syntax tree may be especially useful when you try to get the right meaning of ''precedence'' and ''arity''. After you understand the principles of regexes well, read sections about [[#How Preg questions work|question working]] and [[#Regular expressions reference|regular expression reference]] (to know your possibilities, don't bother to understand or remember them all - just look there periodically for something new to learn). Now you should be able to write regexes without much use of authoring tools, except the testing tool to test your expressions.

===I know regular expressions well enought to write them on my own without further guidance===
You should read about [[#How Preg questions work|question working]] to understand various settings and question behaviour under them. You also may be interested in regex testing in the [[#Authoring tools|authoring tools]] section. Finally, [[#Regular expressions reference|regular expression reference]] may be of some use for you.

==How Preg questions work==
Basically, this question type is an extended version of Shortanswer. It extends its features in several different ways (you could use them in almost any combination):
* '''pattern matching''' - using regular expressions you can create powerful patterns describing possible students answers;
* '''hinting''' - when your students are stuck doing the question, you may allow them to ask for next correct word (lexem) or character (with penalty if you wish so).

===Settings, that affects how question will work===
====Case sensitivity====
You should know this setting from core Shortanswer question type. Note, however, that you can [[#Local case-sensitivity modifiers|change case sensitivity inside you regular expressions]], making only parts of it case sensitive.

====Exact matching====
'''Matching''' means finding a part of the student's answer that suits the regular expression (or you answer). This part called '''match'''. Traditionally, regular expressions were used to look for matches '''inside''' strings, i.e. '''all''' ''regular expression'' should match, but it could match with a '''part of''' ''students response''.

; '''Yes''' : entire students response should match with regular expression.
; '''No''' : any part of students response could match with regular expression. You could still set some of you regex matching with whole student's response using [[#Anchoring|regular expression features]].

====Notations====
Notation is the way you write you regexes. Or choose "Moodle shortanswer" notation to avoid regexes at all, still use hinting features.
; '''Regular expression''' : This is usual notation for regular expression. Precisely it is Perl-compatible regex dialect. You may write regex on multiple strings for better reading - line breaks will be ignored.
; '''Regular expression (extended)''' : This notation is there for a really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not part of character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything from (unescaped and not part of character class) # character to the end of that string is treated as commentary.
; '''Moodle shortanswer''' : Choose this notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

====Matching engines====
A matching engine means different program code that performs regular expression execution. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.

; '''PHP preg extension''' : Use it when you '''don't need hinting''' and '''other engines are rejecting you expressions''' as too difficult or you encounter bugs in them. It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.
; '''Non-deterministing finite state automata(NFA)''' : Use NFA engine to '''perform hinting''' for you students if it can handle you regular expressions. NFA engine is a custom PHP code that uses finite automata to perform matching. It is allow many (but not all) regular expression features and is thoroughly tested (it passes all tests from AT&T testregex suite and most tests from PCRE testinput1 suite for the features it supports, which means quite much), but may still contain bugs in rare cases. Not supported features for now include complex assertions, recursion and conditional subpatterns.
; '''Deterministic finite state automata (DFA)''' : WARNING: This engine lacking support in the past year. Use NFA engine instead if you could (i.e. don't get rejected regexes that this engine accepts), it allows almost anything DFA engine could, but NFA engine much more tested and stable.

===Hinting===
NFA and DFA matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive and interactive behaviours.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion is unanchored ("Exact match" is set to "No") so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

====General hinting rules====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not usually a desirable behaviour.
When possible, hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with graeds from 50% to 100% will be used for hinting and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match with the student response.

====Next character hinting====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

====Next lexem (word) hinting====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language they are a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports these languages (but there will be more):
* '''simple english''' - english language scanner recognize words, numbers and punctuation marks;
* '''C/C++ language''' - a programming language C (or C++);
* '''prinf language''' - a special language for formatting strings in C/C++ programming language, you will have it disabled probably.

Administrator of the site can control what languages are available to the teachers, to avoid confusion. See the settings of the block "Formal languages" in the plugin settings menu.

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. Each language define their own word for it. You can enter another word in the question description, if you don't like default ones.

===Subpattern capturing and feedback===
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' will be replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work in the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do this, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Looking for missing and misplaced things===
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's not problem but a feature.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and its parts), and test it. Authoring tools are activated by pressing the "edit" icon near the regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
; '''syntax tree''' : shows you the inner structure of regular expressions
; '''explaining graph''' : shows you how you expression will work in a graphical way
; '''description''' : formulates the meaning of your expression in English
; '''testing tool''' : allows you to enter strings and see how they match your regexes

INSTALLATION NOTE. To have ''syntax tree'' and ''explaining graph'' tools working you need to have the [http://www.graphviz.org/Graphviz] package installed on your server and fill 'pathtodot' setting on you Moodle installation at Site Administration > Server > System Paths. It is used by authoring tools code to draw pictures for you.

===Regular expression area===
There you can enter (or edit) regular expression and refresh all the tools when done.

TODO You can also select a part of regular expression and it will be mapped on the tree, graph and description.

===Syntax tree===
As was said above, regular expressions, like all expressions, are trees of operators and operands. Syntax tree shows the inner structure of expression graphically: what is inside what. This will be the most useful if you know how to understand regular expressions or [[#Understanding regular expressions|learning to do this]].

If you don't understand operators and precedence conception well, it may have a small meaning to you. But it is still useful to find out, where you need parentheses: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown by dotted part of the tree.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

===Explaining graph===
The graph shows how regular expression works. Its nodes are matched characters, its edges show paths throught the nodes from the beginning to the end.
[[Image:qtype preg authortools4.png|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes are shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers which part of the string matched each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember subexpressions, you may use (?: ) instead of ( ) parentheses, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

TODO Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work.

===Testing tool===
You can enter a set of strings there, one per line. These strings will be matched against your expression. You'll see a coloured strings, showing which parts of your strings matched the expression, so you can test if it works as you expcted.

TODO The strings will be saved in database, if you save regex (they will be lost if you close window with "cancel" button).

==Understanding regular expressions==

===Understanding expressions in general===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or a sets of characters, that is allowed in particular position. '''A''' is a regular expressions that matches a single character 'A'. The '''operators''' in regular expressions define a way to combine individual characters in the pattern: sequence (''concatenation'' operator), alternative and repeating (it is called ''quantifier''). The concatenation is so simple operator, that it doesn't have any character for it at all - just write some characters in sequence, and they'll be concatenated. But it is still have precedence, so that the question can see, did you want to repeat a single character or a sequence of them. Alternative is written as vertical bar. There are many form of quantifiers - most commonly used are question mark (repeat zero or one times), asterisk (zero or more times) and plus (one or more times). You may specify mininimum and maximum number of repeats in curly braces - this is a quantifier too.

The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

Character classes allows you to specify several possible characters for one place. They can be defined in many different ways: by enumeration of characters in square brackets '''[as3]''', by ranges in square brackets '''[a-z]''', by special sequences ('''\d''' means any digit, '''\W''' anything except a letter, digit and underscore, [[:alpha:]] any letter etc). An important type of operand is a ''simple assertions'': they allow you to test some conditions - start of the string '''^''', end of the string '''$''' or word border '''\b'''.

You could find a list and more examples of operands and operators in [[#Regular expressions reference|reference]] section.

===Precedence and order of evaluation===
A '''quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or just "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

An internal structure of regular expression can be viewed vell on the [[#Syntax tree|syntax tree]] (authoring tool). The operators that executed first are placed lower on the tree (or to the right on horizontal view), the operator that executed last is the root of the tree. You can compare tree and explaining graphs for the examples above in authoring tools if this section doesn't seems too clear to you. Remember, that "execution" of regular expression operator means linking them in the string: sequental, alternative linking, or repeating.

===Anchoring===
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

==Regular expressions reference==

===Operands===
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#* full list of characters needs escaping '''\ ^ $ . [ ] | ( ) ? * + { }'''
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

===Operators===
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

===Subpatterns and backreferences===
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

====Duplicate subpattern numbers and names====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

===Complex assertions===
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

===Local case-sensitivity modifiers===
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Error reporting===
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

==The ways to give back==
This software is considered scientific project and as such needs help any scientific project can use:
* an evidence, that results of our work (i.e. Preg questoin type) is really useful to people and was used in production environment;
* a cooperative work to research it's effectiveness for various applications - basically you need to write about how you used this question type and make some survey with you teachers and/or students about it - but it can include co-authoring a conference thesis or journal article;
* cooperating in writing article or help in publishing it in english-language journals (information and help in grants for further work is welcome too).

If you consider any way of helping, do not hesitate to write me about it and ask any questions about details. You may receive individual help during such work too (for example, doing cooperative research I may give you tips how to improve you regexes, etc).

I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts - there are many settings in the question, and regex can be quite complex, so it's hard to do all testing by developers themselves.

==Development plans==
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Improve a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2013-08-23T18:03:57Z

Oasychev: /* Understanding regular expressions */ - written an introduction about operands and operator

{{Questions}}Preg is a question type that uses regular expressions (regexes) to check student's responses (though you can use it without regexes for its hinting features). Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. [[#Ways to use Preg questions and this docs|First section]] should guide you to using of this docs, please use it with discretion. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm. There are many good regex manuals, I'm not going to repeat them here.

Authors:
# Idea, design, question type and behaviours code, hinting and error reporting - Oleg Sychev.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore and unicode support - Valeriy Streltsov.
# DFA regex matching engine - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Explaining tree, regular expression testing (authoring tools) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type.

==Ways to use Preg questions and this docs==
A little foreword: regardless of the way you use Preg and your capabilities, you could aways [[#The ways to give back|give back]]. It is hard to get any feedback to freely distributed software. But you shouldn't expect to get software which ideally suits you needs without telling anyone about these needs, or encouragement, or some non-difficult support to the authors. Sometimes as little as writing where you work and how you use (or what prevents you from using) Preg question type may help a lot.

===I don't (want to) know anything about regular expressions but next word (character) hinting seems useful===
Then you can use Preg question type just as Shortanswer with advanced hinting, without any knowledge about regular expressions. To do this, you need to choose
* '''Notation''' => Moodle shortanswer
* '''Engine''' => Non-deterministic finite state automata
* '''Exact matching''' => Yes

After that, you can just copy answers from you shortanswer questions. You may want to read the section about [[#Hinting|hinting]] to understand more about hinting settings.

===I have a vague knowledge of regular expressions, but want to use pattern matching===
If writing regular expressions is hard for you, but you want to use their strength as patterns, authoring tools may help you a lot to create your questions. The tools show you the meaning of your regex in different ways: internal structure of the expression (syntax tree), visual path of matching (explaining graph) and a text description. They also allow you to test you regex against several strings and see if it works as expected. Experiment and play with your regexes, see corresponding changes in the authoring tools, and eventually you'll get the regex you want.

Read the section on [[#Authoring tools|authoring tools]], than (probably after some experimenting with tools on your own) a start of section about [[#Understanding regular expressions|understanding regular expressions]] (this is optional, but may be interesting and help a lot). You should also read a section about [[#How Preg questions work|question working]] to better understand various settings and how they affects you questions.

===I can make some effort to learn regular expressions well and be able to do anything they allow===
Well, you don't know regexes but want to understand them and create complex expressions easily. Then, instad of blunt trying, you better spend some time and effort reading and understanding [[#Understanding regular expressions|this section]]. Then read slightly about [[#Authoring tools|authoring tools]] and use them to experiment creating regexes. With these tools you can see if you really understand them well and they behave as expected. Syntax tree may be especially useful when you try to get the right meaning of ''precedence'' and ''arity''. After you understand the principles of regexes well, read sections about [[#How Preg questions work|question working]] and [[#Regular expressions reference|regular expression reference]] (to know your possibilities, don't bother to understand or remember them all - just look there periodically for something new to learn). Now you should be able to write regexes without much use of authoring tools, except the testing tool to test your expressions.

===I know regular expressions well enought to write them on my own without further guidance===
You should read about [[#How Preg questions work|question working]] to understand various settings and question behaviour under them. You also may be interested in regex testing in the [[#Authoring tools|authoring tools]] section. Finally, [[#Regular expressions reference|regular expression reference]] may be of some use for you.

==How Preg questions work==
Basically, this question type is an extended version of Shortanswer. It extends its features in several different ways (you could use them in almost any combination):
* '''pattern matching''' - using regular expressions you can create powerful patterns describing possible students answers;
* '''hinting''' - when your students are stuck doing the question, you may allow them to ask for next correct word (lexem) or character (with penalty if you wish so).

===Settings, that affects how question will work===
====Case sensitivity====
You should know this setting from core Shortanswer question type. Note, however, that you can [[#Local case-sensitivity modifiers|change case sensitivity inside you regular expressions]], making only parts of it case sensitive.

====Exact matching====
'''Matching''' means finding a part of the student's answer that suits the regular expression (or you answer). This part called '''match'''. Traditionally, regular expressions were used to look for matches '''inside''' strings, i.e. '''all''' ''regular expression'' should match, but it could match with a '''part of''' ''students response''.

; '''Yes''' : entire students response should match with regular expression.
; '''No''' : any part of students response could match with regular expression. You could still set some of you regex matching with whole student's response using [[#Anchoring|regular expression features]].

====Notations====
Notation is the way you write you regexes. Or choose "Moodle shortanswer" notation to avoid regexes at all, still use hinting features.
; '''Regular expression''' : This is usual notation for regular expression. Precisely it is Perl-compatible regex dialect. You may write regex on multiple strings for better reading - line breaks will be ignored.
; '''Regular expression (extended)''' : This notation is there for a really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not part of character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything from (unescaped and not part of character class) # character to the end of that string is treated as commentary.
; '''Moodle shortanswer''' : Choose this notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

====Matching engines====
A matching engine means different program code that performs regular expression execution. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.

; '''PHP preg extension''' : Use it when you '''don't need hinting''' and '''other engines are rejecting you expressions''' as too difficult or you encounter bugs in them. It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.
; '''Non-deterministing finite state automata(NFA)''' : Use NFA engine to '''perform hinting''' for you students if it can handle you regular expressions. NFA engine is a custom PHP code that uses finite automata to perform matching. It is allow many (but not all) regular expression features and is thoroughly tested (it passes all tests from AT&T testregex suite and most tests from PCRE testinput1 suite for the features it supports, which means quite much), but may still contain bugs in rare cases. Not supported features for now include complex assertions, recursion and conditional subpatterns.
; '''Deterministic finite state automata (DFA)''' : WARNING: This engine lacking support in the past year. Use NFA engine instead if you could (i.e. don't get rejected regexes that this engine accepts), it allows almost anything DFA engine could, but NFA engine much more tested and stable.

===Hinting===
NFA and DFA matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive and interactive behaviours.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion is unanchored ("Exact match" is set to "No") so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

====General hinting rules====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not usually a desirable behaviour.
When possible, hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with graeds from 50% to 100% will be used for hinting and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match with the student response.

====Next character hinting====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

====Next lexem (word) hinting====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language they are a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports these languages (but there will be more):
* '''simple english''' - english language scanner recognize words, numbers and punctuation marks;
* '''C/C++ language''' - a programming language C (or C++);
* '''prinf language''' - a special language for formatting strings in C/C++ programming language, you will have it disabled probably.

Administrator of the site can control what languages are available to the teachers, to avoid confusion. See the settings of the block "Formal languages" in the plugin settings menu.

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. Each language define their own word for it. You can enter another word in the question description, if you don't like default ones.

===Subpattern capturing and feedback===
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' will be replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work in the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do this, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Looking for missing and misplaced things===
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's not problem but a feature.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and its parts), and test it. Authoring tools are activated by pressing the "edit" icon near the regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
; '''syntax tree''' : shows you the inner structure of regular expressions
; '''explaining graph''' : shows you how you expression will work in a graphical way
; '''description''' : formulates the meaning of your expression in English
; '''testing tool''' : allows you to enter strings and see how they match your regexes

INSTALLATION NOTE. To have ''syntax tree'' and ''explaining graph'' tools working you need to have the [http://www.graphviz.org/Graphviz] package installed on your server and fill 'pathtodot' setting on you Moodle installation at Site Administration > Server > System Paths. It is used by authoring tools code to draw pictures for you.

===Regular expression area===
There you can enter (or edit) regular expression and refresh all the tools when done.

TODO You can also select a part of regular expression and it will be mapped on the tree, graph and description.

===Syntax tree===
As was said above, regular expressions, like all expressions, are trees of operators and operands. Syntax tree shows the inner structure of expression graphically: what is inside what. This will be the most useful if you know how to understand regular expressions or [[#Understanding regular expressions|learning to do this]].

If you don't understand operators and precedence conception well, it may have a small meaning to you. But it is still useful to find out, where you need parentheses: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown by dotted part of the tree.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

===Explaining graph===
The graph shows how regular expression works. Its nodes are matched characters, its edges show paths throught the nodes from the beginning to the end.
[[Image:qtype preg authortools4.png|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes are shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers which part of the string matched each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember subexpressions, you may use (?: ) instead of ( ) parentheses, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

TODO Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work.

===Testing tool===
You can enter a set of strings there, one per line. These strings will be matched against your expression. You'll see a coloured strings, showing which parts of your strings matched the expression, so you can test if it works as you expcted.

TODO The strings will be saved in database, if you save regex (they will be lost if you close window with "cancel" button).

==Understanding regular expressions==

===Understanding expressions in general===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or a sets of characters, that is allowed in particular position. '''A''' is a regular expressions that matches a single character 'A'. The '''operators''' in regular expressions define a way to combine individual characters in the pattern: sequence (''concatenation'' operator), alternative and repeating (it is called ''quantifier''). The concatenation is so simple operator, that it doesn't have any character for it at all - just write some characters in sequence, and they'll be concatenated. But it is still have precedence, so that the question can see, did you want to repeat a single character or a sequence of them. Alternative is written as vertical bar. There are many form of quantifiers - most commonly used are question mark (repeat zero or one times), asterisk (zero or more times) and plus (one or more times). You may specify mininimum and maximum number of repeats in curly braces - this is a quantifier too.

The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

Character classes allows you to specify several possible characters for one place. They can be defined in many different ways: by enumeration of characters in square brackets '''[as3]''', by ranges in square brackets '''[a-z]''', by special sequences ('''\d''' means any digit, '''\W''' anything except a letter, digit and underscore, [[:alpha:]] any letter etc). An important type of operand is a ''simple assertions'': they allow you to test some conditions - start of the string '''^''', end of the string '''$''' or word border '''\b'''.

You could find a list and more examples of operands and operators in [[#Regular expressions reference|reference]] section.

===Precedence and order of evaluation===
A '''quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or just "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

===Anchoring===
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

==Regular expressions reference==

===Operands===
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#* full list of characters needs escaping '''\ ^ $ . [ ] | ( ) ? * + { }'''
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

===Operators===
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

===Subpatterns and backreferences===
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

====Duplicate subpattern numbers and names====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

===Complex assertions===
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

===Local case-sensitivity modifiers===
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Error reporting===
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

==The ways to give back==
This software is considered scientific project and as such needs help any scientific project can use:
* an evidence, that results of our work (i.e. Preg questoin type) is really useful to people and was used in production environment;
* a cooperative work to research it's effectiveness for various applications - basically you need to write about how you used this question type and make some survey with you teachers and/or students about it - but it can include co-authoring a conference thesis or journal article;
* cooperating in writing article or help in publishing it in english-language journals (information and help in grants for further work is welcome too).

If you consider any way of helping, do not hesitate to write me about it and ask any questions about details. You may receive individual help during such work too (for example, doing cooperative research I may give you tips how to improve you regexes, etc).

I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts - there are many settings in the question, and regex can be quite complex, so it's hard to do all testing by developers themselves.

==Development plans==
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Improve a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2013-08-23T10:15:04Z

Oasychev: /* Authoring tools */ - added information about graphviz installation

{{Questions}}The Preg question type is a question type that uses regular expressions (regexes) to check student's responses (thought you can use it without regexes for it's hinting features). Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. First section should guide you in using of this docs, please use it with discretion. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm. There are many good regular expression manuals, I'm not going to repeat it there.

Authors:
# Idea, design, question type and behaviours code, hinting and error reporting - Oleg Sychev.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore and unicode support - Valeriy Streltsov.
# DFA regex matching engine - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Explaining tree, regular expression testing (authoring tools) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type.

==Ways to use Preg questions and this docs==
Regardless of the way you use Preg question and you capabilities, you could aways [[#The ways to give back|give back]]. It is hard to get any sort of feedback with freely distributed software. But you shoudn't expect to get software which ideally suitting you needs without telling anyone about these needs, or encouragement, or some non-difficult support to the authors. Sometimes as little as writing where you work and how you use (or what prevents you from using) Preg question type in it may help a lot.

===I don't (want) to know anything about regular expressions but next word(character) hinting seems useful===
You can use Preg question type just as Shortanswer questions with advanced hinting, without any knowledge about regular expressions. To do this, you need to choose
* '''Notation''' => Moodle shortanswer
* '''Engine''' => Non-deterministic finite state automata
* '''Exact matching''' => Yes

After that, you can just copy answers from you shortanswer questions. You may want to read section about [[#Hinting|hinting]] to understand more about hinting settings.

===I have a vague knowledge of regexes, but want to use pattern matching===
If creating regular expressions is a hard task for you, but you want to use their strength as patterns, you may make heavy use of authoring tools to create you questions. Authoring tools shows you a meaning of you expression in different way: an internal structure of expression(syntax tree), a visual path of matching (explaining graph) and a text description. They also allows you to test you regex against several strings and see, whether it work as expected. Experiment and play changing you regexes, see corresponding changes in authoring tools, and eventually you may get regex you want.

Read the section on [[#Authoring tools|authoring tools]], than (probably after some experimenting with tools on you own) a start of section about [[#Understanding regular expressions|understanding regular expressions]] (this is optional, but may be interesting and help a lot). You should also read a section about [[#How question work|question working]] to better understand various settings and how they affects you questions.

===I could spend some effort to learn regular expressions well and be able to do anything I they could===
If you don't know well regular expression, but want to understand them really well and create complex regexes easy; if you want to know what you doing writing you regexes, instead of blunt trying, you should spent some time and effort understanding it. Do not worry - it's not as hard as it sounds.

If you want to do that, read section about [[#Understanding regular expressions|understanding regular expressions]]. Then read slightly about [[#Authoring tools|authoring tools]] and use them to experiment creating regexes. With these tools you could see, if you really understand them well and they behave as expected. Syntax tree may be of special use for you, when you try to getting right meaning of ''precedence'' and ''arity''. After you understand well principles of regular expression, read sections about [[#How question work|question working]] and [[#Regular expressions reference|regular expression reference]] (to know you possibilities, don't bother to understand or remember them all - just look there periodically for something new to learn). Now you should be able to write you regexes without much use of authoring tools, except testing tool to test you regexes.

===I know regular expressions well enought to write them on my own without further guidance===
You should read about [[#How question work|question working]] to understand various settings and question behaviour under them. You also may be interested in regex testing in [[#Authoring tools|authoring tools]] section, . Finally, [[#Regular expressions reference|regular expression reference]] may be of some use to you.

==How question work==
Basically, this question type is an extended version of the Shortanswer. It extends it features in several different ways (you could use them in almost any combination):
* '''pattern matching''' - using regular expressions you could create a powerful patterns describing possible students answers;
* '''hinting''' - when you students are stuck doing the question, you may allow them to ask it for next correct word (lexem) or character (with penalty if you wish so).

===Settings, that affects how question will work===
====Case sensitivity====
You should know this setting from core Shortanswer question type. Note, however, that you can [[#Local case-sensitivity modifiers|change case sensitivity inside you regular expressions]], making only parts of it case sensitive.

====Exact matching====
'''Matching''' means finding a part of the student's answer that suits the regular expression (or you answer). This part called '''match'''. Traditionally, regular expressions were used to look for matches '''inside''' strings, i.e. '''all''' ''regular expression'' should match, but it could match with a '''part of''' ''students response''.

; '''Yes''' : entire students response should match with regular expression.
; '''No''' : any part of students response could match with regular expression. You could still set some of you regex matching with whole student's response using [[#Anchoring|regular expression features]].

====Notations====
Notation is the way you write you regexes. Or choose "Moodle shortanswer" notation to avoid regexes at all, still use hinting features.
; '''Regular expression''' : This is usual notation for regular expression. Precisely it is Perl-compatible regex dialect. You may write regex on multiple strings for better reading - line breaks will be ignored.
; '''Regular expression (extended)''' : This notation is there for a really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not part of character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything from (unescaped and not part of character class) # character to the end of that string is treated as commentary.
; '''Moodle shortanswer''' : Choose this notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

====Matching engines====
A matching engine means different program code that performs regular expression execution. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.

; '''PHP preg extension''' : Use it when you '''don't need hinting''' and '''other engines are rejecting you expressions''' as too difficult or you encounter bugs in them. It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.
; '''Non-deterministing finite state automata(NFA)''' : Use NFA engine to '''perform hinting''' for you students if it can handle you regular expressions. NFA engine is a custom PHP code that uses finite automata to perform matching. It is allow many (but not all) regular expression features and is thoroughly tested (it passes all tests from AT&T testregex suite and most tests from PCRE testinput1 suite for the features it supports, which means quite much), but may still contain bugs in rare cases. Not supported features for now include complex assertions, recursion and conditional subpatterns.
; '''Deterministic finite state automata (DFA)''' : WARNING: This engine lacking support in the past year. Use NFA engine instead if you could (i.e. don't get rejected regexes that this engine accepts), it allows almost anything DFA engine could, but NFA engine much more tested and stable.

===Hinting===
NFA and DFA matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive and interactive behaviours.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion is unanchored ("Exact match" is set to "No") so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

====General hinting rules====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not usually a desirable behaviour.
When possible, hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with graeds from 50% to 100% will be used for hinting and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match with the student response.

====Next character hinting====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

====Next lexem (word) hinting====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language they are a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports these languages (but there will be more):
* '''simple english''' - english language scanner recognize words, numbers and punctuation marks;
* '''C/C++ language''' - a programming language C (or C++);
* '''prinf language''' - a special language for formatting strings in C/C++ programming language, you will have it disabled probably.

Administrator of the site can control what languages are available to the teachers, to avoid confusion. See the settings of the block "Formal languages" in the plugin settings menu.

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. Each language define their own word for it. You can enter another word in the question description, if you don't like default ones.

===Subpattern capturing and feedback===
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' will be replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work in the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do this, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Looking for missing and misplaced things===
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's not problem but a feature.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and it's parts), and test it. Authoring tools are activated by pressing "edit" icon near regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
; '''syntax tree''' : shows you an inner structure of regular expression;
; '''explaining graph''' : shows you how you expression will work in a graphical way;
; '''description''' : formulate the meaning of you expression in the english language;
; '''testing tool''' : allows you to enter strings and see how they match with you regexes.

INSTALLATION NOTE. To have ''syntax tree'' and ''explaining graph'' tools working you need to have [http://www.graphviz.org/ Graphviz] open source packet installed on you server and fill 'pathtodot' setting on you Moodle installation at Site Administration > Server > System Paths. It is used by authoring tools code to draw pictures for you.

===Regular expression area===
There you cold enter (or edit) regular expression and refresh all the tools when done it.

TODO You could also select a part of regular expression and it will be mapped on the tree, graph and description.

===Syntax tree===
As was said above, regular expression is in fact expression - a tree of operators and operands. Syntax tree shows graphically this inner structure of expression: what is inside what. This will be most useful if you know how to understand regular expressions or [[#Understanding regular expressions|learning to do this]].

If you don't understand operators and precedence conception well, it may have a small meaning to you. But it is still useful to find out, where you need parenthesis: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown by dotted part of the tree.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

===Explaining graph===
The graph shows how regular expression works. It's nodes are matched characters, it's edges shows paths throught the nodes from beginning to the end.
[[Image:qtype preg authortools4.png|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers what part of string matched with each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember part of the match, you may speed up you expression using (?: ) instead of ( ) parenthesis, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

TODO Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work.

===Testing tool===
You may enter a set of strings there, for matching against you expression. You'll see a coloured strings, showing which part of you string matched with expressions, so you could test, if it performing as you expcted.

TODO The strings will be saved in database, if you save regex (they will be lost if you close window with "cancel" button).

==Understanding regular expressions==

===Understanding expressions in general===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash.

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

TODO - write more about operands and operators, but simple.

===Precedence and order of evaluation===
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

===Anchoring===
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

==Regular expressions reference==

===Operands===
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#* full list of characters needs escaping '''\ ^ $ . [ ] | ( ) ? * + { }'''
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

===Operators===
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

===Subpatterns and backreferences===
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

====Duplicate subpattern numbers and names====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

===Complex assertions===
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

===Local case-sensitivity modifiers===
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Error reporting===
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

==The ways to give back==
This software is considered scientific project and as such needs help any scientific project can use:
* an evidence, that results of our work (i.e. Preg questoin type) is really useful to people and was used in production environment;
* a cooperative work to research it's effectiveness for various applications - basically you need to write about how you used this question type and make some survey with you teachers and/or students about it - but it can include co-authoring a conference thesis or journal article;
* cooperating in writing article or help in publishing it in english-language journals (information and help in grants for further work is welcome too).

If you consider any way of helping, do not hesitate to write me about it and ask any questions about details. You may receive individual help during such work too (for example, doing cooperative research I may give you tips how to improve you regexes, etc).

I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts - there are many settings in the question, and regex can be quite complex, so it's hard to do all testing by developers themselves.

==Development plans==
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Improve a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2013-08-23T10:06:59Z

Oasychev: /* I don't (want) know anything about regular expressions but next word(character) hinting seems useful */

{{Questions}}The Preg question type is a question type that uses regular expressions (regexes) to check student's responses (thought you can use it without regexes for it's hinting features). Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. First section should guide you in using of this docs, please use it with discretion. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm. There are many good regular expression manuals, I'm not going to repeat it there.

Authors:
# Idea, design, question type and behaviours code, hinting and error reporting - Oleg Sychev.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore and unicode support - Valeriy Streltsov.
# DFA regex matching engine - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Explaining tree, regular expression testing (authoring tools) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type.

==Ways to use Preg questions and this docs==
Regardless of the way you use Preg question and you capabilities, you could aways [[#The ways to give back|give back]]. It is hard to get any sort of feedback with freely distributed software. But you shoudn't expect to get software which ideally suitting you needs without telling anyone about these needs, or encouragement, or some non-difficult support to the authors. Sometimes as little as writing where you work and how you use (or what prevents you from using) Preg question type in it may help a lot.

===I don't (want) to know anything about regular expressions but next word(character) hinting seems useful===
You can use Preg question type just as Shortanswer questions with advanced hinting, without any knowledge about regular expressions. To do this, you need to choose
* '''Notation''' => Moodle shortanswer
* '''Engine''' => Non-deterministic finite state automata
* '''Exact matching''' => Yes

After that, you can just copy answers from you shortanswer questions. You may want to read section about [[#Hinting|hinting]] to understand more about hinting settings.

===I have a vague knowledge of regexes, but want to use pattern matching===
If creating regular expressions is a hard task for you, but you want to use their strength as patterns, you may make heavy use of authoring tools to create you questions. Authoring tools shows you a meaning of you expression in different way: an internal structure of expression(syntax tree), a visual path of matching (explaining graph) and a text description. They also allows you to test you regex against several strings and see, whether it work as expected. Experiment and play changing you regexes, see corresponding changes in authoring tools, and eventually you may get regex you want.

Read the section on [[#Authoring tools|authoring tools]], than (probably after some experimenting with tools on you own) a start of section about [[#Understanding regular expressions|understanding regular expressions]] (this is optional, but may be interesting and help a lot). You should also read a section about [[#How question work|question working]] to better understand various settings and how they affects you questions.

===I could spend some effort to learn regular expressions well and be able to do anything I they could===
If you don't know well regular expression, but want to understand them really well and create complex regexes easy; if you want to know what you doing writing you regexes, instead of blunt trying, you should spent some time and effort understanding it. Do not worry - it's not as hard as it sounds.

If you want to do that, read section about [[#Understanding regular expressions|understanding regular expressions]]. Then read slightly about [[#Authoring tools|authoring tools]] and use them to experiment creating regexes. With these tools you could see, if you really understand them well and they behave as expected. Syntax tree may be of special use for you, when you try to getting right meaning of ''precedence'' and ''arity''. After you understand well principles of regular expression, read sections about [[#How question work|question working]] and [[#Regular expressions reference|regular expression reference]] (to know you possibilities, don't bother to understand or remember them all - just look there periodically for something new to learn). Now you should be able to write you regexes without much use of authoring tools, except testing tool to test you regexes.

===I know regular expressions well enought to write them on my own without further guidance===
You should read about [[#How question work|question working]] to understand various settings and question behaviour under them. You also may be interested in regex testing in [[#Authoring tools|authoring tools]] section, . Finally, [[#Regular expressions reference|regular expression reference]] may be of some use to you.

==How question work==
Basically, this question type is an extended version of the Shortanswer. It extends it features in several different ways (you could use them in almost any combination):
* '''pattern matching''' - using regular expressions you could create a powerful patterns describing possible students answers;
* '''hinting''' - when you students are stuck doing the question, you may allow them to ask it for next correct word (lexem) or character (with penalty if you wish so).

===Settings, that affects how question will work===
====Case sensitivity====
You should know this setting from core Shortanswer question type. Note, however, that you can [[#Local case-sensitivity modifiers|change case sensitivity inside you regular expressions]], making only parts of it case sensitive.

====Exact matching====
'''Matching''' means finding a part of the student's answer that suits the regular expression (or you answer). This part called '''match'''. Traditionally, regular expressions were used to look for matches '''inside''' strings, i.e. '''all''' ''regular expression'' should match, but it could match with a '''part of''' ''students response''.

; '''Yes''' : entire students response should match with regular expression.
; '''No''' : any part of students response could match with regular expression. You could still set some of you regex matching with whole student's response using [[#Anchoring|regular expression features]].

====Notations====
Notation is the way you write you regexes. Or choose "Moodle shortanswer" notation to avoid regexes at all, still use hinting features.
; '''Regular expression''' : This is usual notation for regular expression. Precisely it is Perl-compatible regex dialect. You may write regex on multiple strings for better reading - line breaks will be ignored.
; '''Regular expression (extended)''' : This notation is there for a really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not part of character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything from (unescaped and not part of character class) # character to the end of that string is treated as commentary.
; '''Moodle shortanswer''' : Choose this notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

====Matching engines====
A matching engine means different program code that performs regular expression execution. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.

; '''PHP preg extension''' : Use it when you '''don't need hinting''' and '''other engines are rejecting you expressions''' as too difficult or you encounter bugs in them. It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.
; '''Non-deterministing finite state automata(NFA)''' : Use NFA engine to '''perform hinting''' for you students if it can handle you regular expressions. NFA engine is a custom PHP code that uses finite automata to perform matching. It is allow many (but not all) regular expression features and is thoroughly tested (it passes all tests from AT&T testregex suite and most tests from PCRE testinput1 suite for the features it supports, which means quite much), but may still contain bugs in rare cases. Not supported features for now include complex assertions, recursion and conditional subpatterns.
; '''Deterministic finite state automata (DFA)''' : WARNING: This engine lacking support in the past year. Use NFA engine instead if you could (i.e. don't get rejected regexes that this engine accepts), it allows almost anything DFA engine could, but NFA engine much more tested and stable.

===Hinting===
NFA and DFA matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive and interactive behaviours.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion is unanchored ("Exact match" is set to "No") so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

====General hinting rules====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not usually a desirable behaviour.
When possible, hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with graeds from 50% to 100% will be used for hinting and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match with the student response.

====Next character hinting====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

====Next lexem (word) hinting====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language they are a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports these languages (but there will be more):
* '''simple english''' - english language scanner recognize words, numbers and punctuation marks;
* '''C/C++ language''' - a programming language C (or C++);
* '''prinf language''' - a special language for formatting strings in C/C++ programming language, you will have it disabled probably.

Administrator of the site can control what languages are available to the teachers, to avoid confusion. See the settings of the block "Formal languages" in the plugin settings menu.

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. Each language define their own word for it. You can enter another word in the question description, if you don't like default ones.

===Subpattern capturing and feedback===
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' will be replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work in the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do this, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Looking for missing and misplaced things===
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's not problem but a feature.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and it's parts), and test it. Authoring tools are activated by pressing "edit" icon near regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
; '''syntax tree''' : shows you an inner structure of regular expression;
; '''explaining graph''' : shows you how you expression will work in a graphical way;
; '''description''' : formulate the meaning of you expression in the english language;
; '''testing tool''' : allows you to enter strings and see how they match with you regexes.

===Regular expression area===
There you cold enter (or edit) regular expression and refresh all the tools when done it.

TODO You could also select a part of regular expression and it will be mapped on the tree, graph and description.

===Syntax tree===
As was said above, regular expression is in fact expression - a tree of operators and operands. Syntax tree shows graphically this inner structure of expression: what is inside what. This will be most useful if you know how to understand regular expressions or [[#Understanding regular expressions|learning to do this]].

If you don't understand operators and precedence conception well, it may have a small meaning to you. But it is still useful to find out, where you need parenthesis: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown by dotted part of the tree.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

===Explaining graph===
The graph shows how regular expression works. It's nodes are matched characters, it's edges shows paths throught the nodes from beginning to the end.
[[Image:qtype preg authortools4.png|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers what part of string matched with each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember part of the match, you may speed up you expression using (?: ) instead of ( ) parenthesis, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

TODO Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work.

===Testing tool===
You may enter a set of strings there, for matching against you expression. You'll see a coloured strings, showing which part of you string matched with expressions, so you could test, if it performing as you expcted.

TODO The strings will be saved in database, if you save regex (they will be lost if you close window with "cancel" button).

==Understanding regular expressions==

===Understanding expressions in general===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash.

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

TODO - write more about operands and operators, but simple.

===Precedence and order of evaluation===
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

===Anchoring===
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

==Regular expressions reference==

===Operands===
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#* full list of characters needs escaping '''\ ^ $ . [ ] | ( ) ? * + { }'''
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

===Operators===
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

===Subpatterns and backreferences===
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

====Duplicate subpattern numbers and names====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

===Complex assertions===
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

===Local case-sensitivity modifiers===
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Error reporting===
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

==The ways to give back==
This software is considered scientific project and as such needs help any scientific project can use:
* an evidence, that results of our work (i.e. Preg questoin type) is really useful to people and was used in production environment;
* a cooperative work to research it's effectiveness for various applications - basically you need to write about how you used this question type and make some survey with you teachers and/or students about it - but it can include co-authoring a conference thesis or journal article;
* cooperating in writing article or help in publishing it in english-language journals (information and help in grants for further work is welcome too).

If you consider any way of helping, do not hesitate to write me about it and ask any questions about details. You may receive individual help during such work too (for example, doing cooperative research I may give you tips how to improve you regexes, etc).

I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts - there are many settings in the question, and regex can be quite complex, so it's hard to do all testing by developers themselves.

==Development plans==
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Improve a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2013-08-22T21:32:28Z

Oasychev: /* Authoring tools */ - minor reformatting

{{Questions}}The Preg question type is a question type that uses regular expressions (regexes) to check student's responses (thought you can use it without regexes for it's hinting features). Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. First section should guide you in using of this docs, please use it with discretion. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm. There are many good regular expression manuals, I'm not going to repeat it there.

Authors:
# Idea, design, question type and behaviours code, hinting and error reporting - Oleg Sychev.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore and unicode support - Valeriy Streltsov.
# DFA regex matching engine - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Explaining tree, regular expression testing (authoring tools) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type.

==Ways to use Preg questions and this docs==
Regardless of the way you use Preg question and you capabilities, you could aways [[#The ways to give back|give back]]. It is hard to get any sort of feedback with freely distributed software. But you shoudn't expect to get software which ideally suitting you needs without telling anyone about these needs, or encouragement, or some non-difficult support to the authors. Sometimes as little as writing where you work and how you use (or what prevents you from using) Preg question type in it may help a lot.

===I don't (want) know anything about regular expressions but next word(character) hinting seems useful===
You can use Preg question type just as Shortanswer questions with advanced hinting, without any knowledge about regular expressions. To do this, you need to choose
* '''Notation''' => Moodle shortanswer
* '''Engine''' => Non-deterministic finite state automata
* '''Exact matching''' => Yes

After that, you can just copy answers from you shortanswer questions. You may want to read section about [[#How question work|question working]] to understand more about hinting settings.

===I have a vague knowledge of regexes, but want to use pattern matching===
If creating regular expressions is a hard task for you, but you want to use their strength as patterns, you may make heavy use of authoring tools to create you questions. Authoring tools shows you a meaning of you expression in different way: an internal structure of expression(syntax tree), a visual path of matching (explaining graph) and a text description. They also allows you to test you regex against several strings and see, whether it work as expected. Experiment and play changing you regexes, see corresponding changes in authoring tools, and eventually you may get regex you want.

Read the section on [[#Authoring tools|authoring tools]], than (probably after some experimenting with tools on you own) a start of section about [[#Understanding regular expressions|understanding regular expressions]] (this is optional, but may be interesting and help a lot). You should also read a section about [[#How question work|question working]] to better understand various settings and how they affects you questions.

===I could spend some effort to learn regular expressions well and be able to do anything I they could===
If you don't know well regular expression, but want to understand them really well and create complex regexes easy; if you want to know what you doing writing you regexes, instead of blunt trying, you should spent some time and effort understanding it. Do not worry - it's not as hard as it sounds.

If you want to do that, read section about [[#Understanding regular expressions|understanding regular expressions]]. Then read slightly about [[#Authoring tools|authoring tools]] and use them to experiment creating regexes. With these tools you could see, if you really understand them well and they behave as expected. Syntax tree may be of special use for you, when you try to getting right meaning of ''precedence'' and ''arity''. After you understand well principles of regular expression, read sections about [[#How question work|question working]] and [[#Regular expressions reference|regular expression reference]] (to know you possibilities, don't bother to understand or remember them all - just look there periodically for something new to learn). Now you should be able to write you regexes without much use of authoring tools, except testing tool to test you regexes.

===I know regular expressions well enought to write them on my own without further guidance===
You should read about [[#How question work|question working]] to understand various settings and question behaviour under them. You also may be interested in regex testing in [[#Authoring tools|authoring tools]] section, . Finally, [[#Regular expressions reference|regular expression reference]] may be of some use to you.

==How question work==
Basically, this question type is an extended version of the Shortanswer. It extends it features in several different ways (you could use them in almost any combination):
* '''pattern matching''' - using regular expressions you could create a powerful patterns describing possible students answers;
* '''hinting''' - when you students are stuck doing the question, you may allow them to ask it for next correct word (lexem) or character (with penalty if you wish so).

===Settings, that affects how question will work===
====Case sensitivity====
You should know this setting from core Shortanswer question type. Note, however, that you can [[#Local case-sensitivity modifiers|change case sensitivity inside you regular expressions]], making only parts of it case sensitive.

====Exact matching====
'''Matching''' means finding a part of the student's answer that suits the regular expression (or you answer). This part called '''match'''. Traditionally, regular expressions were used to look for matches '''inside''' strings, i.e. '''all''' ''regular expression'' should match, but it could match with a '''part of''' ''students response''.

; '''Yes''' : entire students response should match with regular expression.
; '''No''' : any part of students response could match with regular expression. You could still set some of you regex matching with whole student's response using [[#Anchoring|regular expression features]].

====Notations====
Notation is the way you write you regexes. Or choose "Moodle shortanswer" notation to avoid regexes at all, still use hinting features.
; '''Regular expression''' : This is usual notation for regular expression. Precisely it is Perl-compatible regex dialect. You may write regex on multiple strings for better reading - line breaks will be ignored.
; '''Regular expression (extended)''' : This notation is there for a really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not part of character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything from (unescaped and not part of character class) # character to the end of that string is treated as commentary.
; '''Moodle shortanswer''' : Choose this notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

====Matching engines====
A matching engine means different program code that performs regular expression execution. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.

; '''PHP preg extension''' : Use it when you '''don't need hinting''' and '''other engines are rejecting you expressions''' as too difficult or you encounter bugs in them. It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.
; '''Non-deterministing finite state automata(NFA)''' : Use NFA engine to '''perform hinting''' for you students if it can handle you regular expressions. NFA engine is a custom PHP code that uses finite automata to perform matching. It is allow many (but not all) regular expression features and is thoroughly tested (it passes all tests from AT&T testregex suite and most tests from PCRE testinput1 suite for the features it supports, which means quite much), but may still contain bugs in rare cases. Not supported features for now include complex assertions, recursion and conditional subpatterns.
; '''Deterministic finite state automata (DFA)''' : WARNING: This engine lacking support in the past year. Use NFA engine instead if you could (i.e. don't get rejected regexes that this engine accepts), it allows almost anything DFA engine could, but NFA engine much more tested and stable.

===Hinting===
NFA and DFA matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive and interactive behaviours.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion is unanchored ("Exact match" is set to "No") so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

====General hinting rules====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not usually a desirable behaviour.
When possible, hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with graeds from 50% to 100% will be used for hinting and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match with the student response.

====Next character hinting====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

====Next lexem (word) hinting====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language they are a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports these languages (but there will be more):
* '''simple english''' - english language scanner recognize words, numbers and punctuation marks;
* '''C/C++ language''' - a programming language C (or C++);
* '''prinf language''' - a special language for formatting strings in C/C++ programming language, you will have it disabled probably.

Administrator of the site can control what languages are available to the teachers, to avoid confusion. See the settings of the block "Formal languages" in the plugin settings menu.

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. Each language define their own word for it. You can enter another word in the question description, if you don't like default ones.

===Subpattern capturing and feedback===
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' will be replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work in the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do this, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Looking for missing and misplaced things===
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's not problem but a feature.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and it's parts), and test it. Authoring tools are activated by pressing "edit" icon near regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
; '''syntax tree''' : shows you an inner structure of regular expression;
; '''explaining graph''' : shows you how you expression will work in a graphical way;
; '''description''' : formulate the meaning of you expression in the english language;
; '''testing tool''' : allows you to enter strings and see how they match with you regexes.

===Regular expression area===
There you cold enter (or edit) regular expression and refresh all the tools when done it.

TODO You could also select a part of regular expression and it will be mapped on the tree, graph and description.

===Syntax tree===
As was said above, regular expression is in fact expression - a tree of operators and operands. Syntax tree shows graphically this inner structure of expression: what is inside what. This will be most useful if you know how to understand regular expressions or [[#Understanding regular expressions|learning to do this]].

If you don't understand operators and precedence conception well, it may have a small meaning to you. But it is still useful to find out, where you need parenthesis: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown by dotted part of the tree.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

===Explaining graph===
The graph shows how regular expression works. It's nodes are matched characters, it's edges shows paths throught the nodes from beginning to the end.
[[Image:qtype preg authortools4.png|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers what part of string matched with each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember part of the match, you may speed up you expression using (?: ) instead of ( ) parenthesis, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

TODO Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work.

===Testing tool===
You may enter a set of strings there, for matching against you expression. You'll see a coloured strings, showing which part of you string matched with expressions, so you could test, if it performing as you expcted.

TODO The strings will be saved in database, if you save regex (they will be lost if you close window with "cancel" button).

==Understanding regular expressions==

===Understanding expressions in general===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash.

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

TODO - write more about operands and operators, but simple.

===Precedence and order of evaluation===
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

===Anchoring===
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

==Regular expressions reference==

===Operands===
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#* full list of characters needs escaping '''\ ^ $ . [ ] | ( ) ? * + { }'''
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

===Operators===
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

===Subpatterns and backreferences===
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

====Duplicate subpattern numbers and names====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

===Complex assertions===
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

===Local case-sensitivity modifiers===
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Error reporting===
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

==The ways to give back==
This software is considered scientific project and as such needs help any scientific project can use:
* an evidence, that results of our work (i.e. Preg questoin type) is really useful to people and was used in production environment;
* a cooperative work to research it's effectiveness for various applications - basically you need to write about how you used this question type and make some survey with you teachers and/or students about it - but it can include co-authoring a conference thesis or journal article;
* cooperating in writing article or help in publishing it in english-language journals (information and help in grants for further work is welcome too).

If you consider any way of helping, do not hesitate to write me about it and ask any questions about details. You may receive individual help during such work too (for example, doing cooperative research I may give you tips how to improve you regexes, etc).

I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts - there are many settings in the question, and regex can be quite complex, so it's hard to do all testing by developers themselves.

==Development plans==
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Improve a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2013-08-22T21:28:40Z

Oasychev: /* How question work */ - rewritten section on question working

{{Questions}}The Preg question type is a question type that uses regular expressions (regexes) to check student's responses (thought you can use it without regexes for it's hinting features). Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. First section should guide you in using of this docs, please use it with discretion. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm. There are many good regular expression manuals, I'm not going to repeat it there.

Authors:
# Idea, design, question type and behaviours code, hinting and error reporting - Oleg Sychev.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore and unicode support - Valeriy Streltsov.
# DFA regex matching engine - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Explaining tree, regular expression testing (authoring tools) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type.

==Ways to use Preg questions and this docs==
Regardless of the way you use Preg question and you capabilities, you could aways [[#The ways to give back|give back]]. It is hard to get any sort of feedback with freely distributed software. But you shoudn't expect to get software which ideally suitting you needs without telling anyone about these needs, or encouragement, or some non-difficult support to the authors. Sometimes as little as writing where you work and how you use (or what prevents you from using) Preg question type in it may help a lot.

===I don't (want) know anything about regular expressions but next word(character) hinting seems useful===
You can use Preg question type just as Shortanswer questions with advanced hinting, without any knowledge about regular expressions. To do this, you need to choose
* '''Notation''' => Moodle shortanswer
* '''Engine''' => Non-deterministic finite state automata
* '''Exact matching''' => Yes

After that, you can just copy answers from you shortanswer questions. You may want to read section about [[#How question work|question working]] to understand more about hinting settings.

===I have a vague knowledge of regexes, but want to use pattern matching===
If creating regular expressions is a hard task for you, but you want to use their strength as patterns, you may make heavy use of authoring tools to create you questions. Authoring tools shows you a meaning of you expression in different way: an internal structure of expression(syntax tree), a visual path of matching (explaining graph) and a text description. They also allows you to test you regex against several strings and see, whether it work as expected. Experiment and play changing you regexes, see corresponding changes in authoring tools, and eventually you may get regex you want.

Read the section on [[#Authoring tools|authoring tools]], than (probably after some experimenting with tools on you own) a start of section about [[#Understanding regular expressions|understanding regular expressions]] (this is optional, but may be interesting and help a lot). You should also read a section about [[#How question work|question working]] to better understand various settings and how they affects you questions.

===I could spend some effort to learn regular expressions well and be able to do anything I they could===
If you don't know well regular expression, but want to understand them really well and create complex regexes easy; if you want to know what you doing writing you regexes, instead of blunt trying, you should spent some time and effort understanding it. Do not worry - it's not as hard as it sounds.

If you want to do that, read section about [[#Understanding regular expressions|understanding regular expressions]]. Then read slightly about [[#Authoring tools|authoring tools]] and use them to experiment creating regexes. With these tools you could see, if you really understand them well and they behave as expected. Syntax tree may be of special use for you, when you try to getting right meaning of ''precedence'' and ''arity''. After you understand well principles of regular expression, read sections about [[#How question work|question working]] and [[#Regular expressions reference|regular expression reference]] (to know you possibilities, don't bother to understand or remember them all - just look there periodically for something new to learn). Now you should be able to write you regexes without much use of authoring tools, except testing tool to test you regexes.

===I know regular expressions well enought to write them on my own without further guidance===
You should read about [[#How question work|question working]] to understand various settings and question behaviour under them. You also may be interested in regex testing in [[#Authoring tools|authoring tools]] section, . Finally, [[#Regular expressions reference|regular expression reference]] may be of some use to you.

==How question work==
Basically, this question type is an extended version of the Shortanswer. It extends it features in several different ways (you could use them in almost any combination):
* '''pattern matching''' - using regular expressions you could create a powerful patterns describing possible students answers;
* '''hinting''' - when you students are stuck doing the question, you may allow them to ask it for next correct word (lexem) or character (with penalty if you wish so).

===Settings, that affects how question will work===
====Case sensitivity====
You should know this setting from core Shortanswer question type. Note, however, that you can [[#Local case-sensitivity modifiers|change case sensitivity inside you regular expressions]], making only parts of it case sensitive.

====Exact matching====
'''Matching''' means finding a part of the student's answer that suits the regular expression (or you answer). This part called '''match'''. Traditionally, regular expressions were used to look for matches '''inside''' strings, i.e. '''all''' ''regular expression'' should match, but it could match with a '''part of''' ''students response''.

; '''Yes''' : entire students response should match with regular expression.
; '''No''' : any part of students response could match with regular expression. You could still set some of you regex matching with whole student's response using [[#Anchoring|regular expression features]].

====Notations====
Notation is the way you write you regexes. Or choose "Moodle shortanswer" notation to avoid regexes at all, still use hinting features.
; '''Regular expression''' : This is usual notation for regular expression. Precisely it is Perl-compatible regex dialect. You may write regex on multiple strings for better reading - line breaks will be ignored.
; '''Regular expression (extended)''' : This notation is there for a really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not part of character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything from (unescaped and not part of character class) # character to the end of that string is treated as commentary.
; '''Moodle shortanswer''' : Choose this notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

====Matching engines====
A matching engine means different program code that performs regular expression execution. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.

; '''PHP preg extension''' : Use it when you '''don't need hinting''' and '''other engines are rejecting you expressions''' as too difficult or you encounter bugs in them. It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.
; '''Non-deterministing finite state automata(NFA)''' : Use NFA engine to '''perform hinting''' for you students if it can handle you regular expressions. NFA engine is a custom PHP code that uses finite automata to perform matching. It is allow many (but not all) regular expression features and is thoroughly tested (it passes all tests from AT&T testregex suite and most tests from PCRE testinput1 suite for the features it supports, which means quite much), but may still contain bugs in rare cases. Not supported features for now include complex assertions, recursion and conditional subpatterns.
; '''Deterministic finite state automata (DFA)''' : WARNING: This engine lacking support in the past year. Use NFA engine instead if you could (i.e. don't get rejected regexes that this engine accepts), it allows almost anything DFA engine could, but NFA engine much more tested and stable.

===Hinting===
NFA and DFA matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive and interactive behaviours.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion is unanchored ("Exact match" is set to "No") so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

====General hinting rules====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not usually a desirable behaviour.
When possible, hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with graeds from 50% to 100% will be used for hinting and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match with the student response.

====Next character hinting====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

====Next lexem (word) hinting====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language they are a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports these languages (but there will be more):
* '''simple english''' - english language scanner recognize words, numbers and punctuation marks;
* '''C/C++ language''' - a programming language C (or C++);
* '''prinf language''' - a special language for formatting strings in C/C++ programming language, you will have it disabled probably.

Administrator of the site can control what languages are available to the teachers, to avoid confusion. See the settings of the block "Formal languages" in the plugin settings menu.

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. Each language define their own word for it. You can enter another word in the question description, if you don't like default ones.

===Subpattern capturing and feedback===
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' will be replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work in the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do this, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Looking for missing and misplaced things===
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's not problem but a feature.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and it's parts), and test it. Authoring tools are activated by pressing "edit" icon near regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
# **syntax tree** - shows you an inner structure of regular expression;
# **explaining graph** - shows you how you expression will work in a graphical way;
# **description** - formulate the meaning of you expression in the english language;
# **testing tool** - allows you to enter strings and see how they match with you regexes.

===Regular expression area===
There you cold enter (or edit) regular expression and refresh all the tools from it.

TODO You could also select a part of regular expression and it will be mapped on the tree, graph and description.

===Syntax tree===
As was said above, regular expression is in fact expression - a tree of operators and operands. Syntax tree shows graphically this inner structure of expression: what is inside what.

If you don't understand operator and precedence well, it may have a small meaning to you. But it is still useful to find out, where you need parenthesis: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown by dotted part of the tree.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

===Explaining graph===
The graph shows how regular expression works. It's nodes are matched characters, it's edges shows paths throught the nodes from beginning to the end.
[[Image:qtype preg authortools4.png|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers what part of string matched with each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember part of the match, you may speed up you expression using (?: ) instead of ( ) parenthesis, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

TODO Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work.

===Testing tool===
You may enter a set of strings there, for matching against you expression. You'll see a coloured strings, showing which part of you string matched with expressions, so you could test, if it performing as you expcted.

TODO The strings will be saved in database when saving the question, if you save regex (they will be lost if you close window with "cancel" button).

==Understanding regular expressions==

===Understanding expressions in general===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash.

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

TODO - write more about operands and operators, but simple.

===Precedence and order of evaluation===
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

===Anchoring===
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

==Regular expressions reference==

===Operands===
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#* full list of characters needs escaping '''\ ^ $ . [ ] | ( ) ? * + { }'''
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

===Operators===
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

===Subpatterns and backreferences===
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

====Duplicate subpattern numbers and names====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

===Complex assertions===
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

===Local case-sensitivity modifiers===
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Error reporting===
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

==The ways to give back==
This software is considered scientific project and as such needs help any scientific project can use:
* an evidence, that results of our work (i.e. Preg questoin type) is really useful to people and was used in production environment;
* a cooperative work to research it's effectiveness for various applications - basically you need to write about how you used this question type and make some survey with you teachers and/or students about it - but it can include co-authoring a conference thesis or journal article;
* cooperating in writing article or help in publishing it in english-language journals (information and help in grants for further work is welcome too).

If you consider any way of helping, do not hesitate to write me about it and ask any questions about details. You may receive individual help during such work too (for example, doing cooperative research I may give you tips how to improve you regexes, etc).

I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts - there are many settings in the question, and regex can be quite complex, so it's hard to do all testing by developers themselves.

==Development plans==
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Improve a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2013-08-22T21:18:47Z

Oasychev: /* How question work */

{{Questions}}The Preg question type is a question type that uses regular expressions (regexes) to check student's responses (thought you can use it without regexes for it's hinting features). Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. First section should guide you in using of this docs, please use it with discretion. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm. There are many good regular expression manuals, I'm not going to repeat it there.

Authors:
# Idea, design, question type and behaviours code, hinting and error reporting - Oleg Sychev.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore and unicode support - Valeriy Streltsov.
# DFA regex matching engine - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Explaining tree, regular expression testing (authoring tools) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type.

==Ways to use Preg questions and this docs==
Regardless of the way you use Preg question and you capabilities, you could aways [[#The ways to give back|give back]]. It is hard to get any sort of feedback with freely distributed software. But you shoudn't expect to get software which ideally suitting you needs without telling anyone about these needs, or encouragement, or some non-difficult support to the authors. Sometimes as little as writing where you work and how you use (or what prevents you from using) Preg question type in it may help a lot.

===I don't (want) know anything about regular expressions but next word(character) hinting seems useful===
You can use Preg question type just as Shortanswer questions with advanced hinting, without any knowledge about regular expressions. To do this, you need to choose
* '''Notation''' => Moodle shortanswer
* '''Engine''' => Non-deterministic finite state automata
* '''Exact matching''' => Yes

After that, you can just copy answers from you shortanswer questions. You may want to read section about [[#How question work|question working]] to understand more about hinting settings.

===I have a vague knowledge of regexes, but want to use pattern matching===
If creating regular expressions is a hard task for you, but you want to use their strength as patterns, you may make heavy use of authoring tools to create you questions. Authoring tools shows you a meaning of you expression in different way: an internal structure of expression(syntax tree), a visual path of matching (explaining graph) and a text description. They also allows you to test you regex against several strings and see, whether it work as expected. Experiment and play changing you regexes, see corresponding changes in authoring tools, and eventually you may get regex you want.

Read the section on [[#Authoring tools|authoring tools]], than (probably after some experimenting with tools on you own) a start of section about [[#Understanding regular expressions|understanding regular expressions]] (this is optional, but may be interesting and help a lot). You should also read a section about [[#How question work|question working]] to better understand various settings and how they affects you questions.

===I could spend some effort to learn regular expressions well and be able to do anything I they could===
If you don't know well regular expression, but want to understand them really well and create complex regexes easy; if you want to know what you doing writing you regexes, instead of blunt trying, you should spent some time and effort understanding it. Do not worry - it's not as hard as it sounds.

If you want to do that, read section about [[#Understanding regular expressions|understanding regular expressions]]. Then read slightly about [[#Authoring tools|authoring tools]] and use them to experiment creating regexes. With these tools you could see, if you really understand them well and they behave as expected. Syntax tree may be of special use for you, when you try to getting right meaning of ''precedence'' and ''arity''. After you understand well principles of regular expression, read sections about [[#How question work|question working]] and [[#Regular expressions reference|regular expression reference]] (to know you possibilities, don't bother to understand or remember them all - just look there periodically for something new to learn). Now you should be able to write you regexes without much use of authoring tools, except testing tool to test you regexes.

===I know regular expressions well enought to write them on my own without further guidance===
You should read about [[#How question work|question working]] to understand various settings and question behaviour under them. You also may be interested in regex testing in [[#Authoring tools|authoring tools]] section, . Finally, [[#Regular expressions reference|regular expression reference]] may be of some use to you.

==How question work==
Basically, this question type is an extended version of the Shortanswer. It extends it features in several different ways (you could use them in almost any combination):
* pattern matching - using regular expressions you could create a powerful patterns describing possible students answers;
* hinting - when you students are stuck doing the question, you may allow them to ask it for next correct word (lexem) or character (with penalty if you wish so).

===Settings, that affects how question will work===
====Case sensitivity====
You should know this setting from core Shortanswer question type. Note, however, that you can [[#Local case-sensitivity modifiers|change case sensitivity inside you regular expressions]], making only parts of it case sensitive.

====Exact matching====
'''Matching''' means finding a part of the student's answer that suits the regular expression (or you answer). This part called '''match'''. Traditionally, regular expressions were used to look for matches '''inside''' strings, i.e. '''all''' ''regular expression'' should match, but it could match with a '''part of''' ''students response''. If you want such behaviour, you may set "Exact matching" to "No". You could still set some of you answers matching with whole student's response using [[#Anchoring|regular expression features]].

If you like you answers to match only with the whole student response, and don't know all regex mumbo-jumbo needed to do it, you may just set "Exact matching" to "Yes", and question will do all the work for you. That is usual way when creating question, so "Yes" is default there.

====Notations====
Notation is the way you write you regexes. Or choose "Moodle shortanswer" notation to avoid regexes at all, still use hinting features.
* The '''Regular expression''' which means Perl-compatible regex dialect is the default one. You may write regex on multiple strings for better reading - line breaks will be ignored.
* The '''Regular expression (extended)''' notation is there for a really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not part of character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything from (unescaped and not part of character class) # character to the end of that string is treated as commentary.
* Choose the '''Moodle shortanswer''' notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

===Matching engines===
A matching engine means different program code that performs regular expression execution. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.

====PHP preg extension====
Use it when you don't need hinting and other engines are rejecting you expressions as too difficult or you encouner bugs in this engine.

It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.

====Non-deterministing finite state automata(NFA)====
Use NFA engine to perform hinting for you students if it can handle you regular expressions.

NFA engine is a custom PHP code that uses finite automata to perform matching. It is allow many (but not all) regular expression features and is thoroughly tested (it passes all tests from AT&T testregex suite and most tests from PCRE testinput1 suite for the features it supports, which means quite much), but may still contain bugs in rare cases.

Not supported features for now include:
* complex assertions;
* recursion;
* conditional subpatterns.

====Deterministic finite state automata (DFA)====
WARNING: This engine lacking support in the past year. Use NFA engine instead if you could (i.e. don't get rejected regexes that this engine accepts), it allows almost anything DFA engine could, but NFA engine much more tested and stable.

===Hinting===
NFA and DFA matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive and interactive behaviours.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion is unanchored ("Exact match" is set to "No") so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

====General hinting rules====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not usually a desirable behaviour.
When possible, hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with graeds from 50% to 100% will be used for hinting and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match with the student response.

====Next character hinting====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

====Next lexem (word) hinting====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language they are a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports these languages (but there will be more):
* '''simple english''' - english language scanner recognize words, numbers and punctuation marks;
* '''C/C++ language''' - a programming language C (or C++);
* '''prinf language''' - a special language for formatting strings in C/C++ programming language, you will have it disabled probably.

Administrator of the site can control what languages are available to the teachers, to avoid confusion. See the settings of the block "Formal languages" in the plugin settings menu.

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. Each language define their own word for it. You can enter another word in the question description, if you don't like default ones.

===Subpattern capturing and feedback===
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' will be replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work in the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do this, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Looking for missing and misplaced things===
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's not problem but a feature.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and it's parts), and test it. Authoring tools are activated by pressing "edit" icon near regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
# **syntax tree** - shows you an inner structure of regular expression;
# **explaining graph** - shows you how you expression will work in a graphical way;
# **description** - formulate the meaning of you expression in the english language;
# **testing tool** - allows you to enter strings and see how they match with you regexes.

===Regular expression area===
There you cold enter (or edit) regular expression and refresh all the tools from it.

TODO You could also select a part of regular expression and it will be mapped on the tree, graph and description.

===Syntax tree===
As was said above, regular expression is in fact expression - a tree of operators and operands. Syntax tree shows graphically this inner structure of expression: what is inside what.

If you don't understand operator and precedence well, it may have a small meaning to you. But it is still useful to find out, where you need parenthesis: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown by dotted part of the tree.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

===Explaining graph===
The graph shows how regular expression works. It's nodes are matched characters, it's edges shows paths throught the nodes from beginning to the end.
[[Image:qtype preg authortools4.png|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers what part of string matched with each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember part of the match, you may speed up you expression using (?: ) instead of ( ) parenthesis, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

TODO Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work.

===Testing tool===
You may enter a set of strings there, for matching against you expression. You'll see a coloured strings, showing which part of you string matched with expressions, so you could test, if it performing as you expcted.

TODO The strings will be saved in database when saving the question, if you save regex (they will be lost if you close window with "cancel" button).

==Understanding regular expressions==

===Understanding expressions in general===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash.

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

TODO - write more about operands and operators, but simple.

===Precedence and order of evaluation===
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

===Anchoring===
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

==Regular expressions reference==

===Operands===
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#* full list of characters needs escaping '''\ ^ $ . [ ] | ( ) ? * + { }'''
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

===Operators===
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

===Subpatterns and backreferences===
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

====Duplicate subpattern numbers and names====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

===Complex assertions===
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

===Local case-sensitivity modifiers===
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Error reporting===
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

==The ways to give back==
This software is considered scientific project and as such needs help any scientific project can use:
* an evidence, that results of our work (i.e. Preg questoin type) is really useful to people and was used in production environment;
* a cooperative work to research it's effectiveness for various applications - basically you need to write about how you used this question type and make some survey with you teachers and/or students about it - but it can include co-authoring a conference thesis or journal article;
* cooperating in writing article or help in publishing it in english-language journals (information and help in grants for further work is welcome too).

If you consider any way of helping, do not hesitate to write me about it and ask any questions about details. You may receive individual help during such work too (for example, doing cooperative research I may give you tips how to improve you regexes, etc).

I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts - there are many settings in the question, and regex can be quite complex, so it's hard to do all testing by developers themselves.

==Development plans==
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Improve a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2013-08-22T20:35:20Z

Oasychev:

{{Questions}}The Preg question type is a question type that uses regular expressions (regexes) to check student's responses (thought you can use it without regexes for it's hinting features). Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. First section should guide you in using of this docs, please use it with discretion. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm. There are many good regular expression manuals, I'm not going to repeat it there.

Authors:
# Idea, design, question type and behaviours code, hinting and error reporting - Oleg Sychev.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore and unicode support - Valeriy Streltsov.
# DFA regex matching engine - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Explaining tree, regular expression testing (authoring tools) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type.

==Ways to use Preg questions and this docs==
Regardless of the way you use Preg question and you capabilities, you could aways [[#The ways to give back|give back]]. It is hard to get any sort of feedback with freely distributed software. But you shoudn't expect to get software which ideally suitting you needs without telling anyone about these needs, or encouragement, or some non-difficult support to the authors. Sometimes as little as writing where you work and how you use (or what prevents you from using) Preg question type in it may help a lot.

===I don't (want) know anything about regular expressions but next word(character) hinting seems useful===
You can use Preg question type just as Shortanswer questions with advanced hinting, without any knowledge about regular expressions. To do this, you need to choose
* '''Notation''' => Moodle shortanswer
* '''Engine''' => Non-deterministic finite state automata
* '''Exact matching''' => Yes

After that, you can just copy answers from you shortanswer questions. You may want to read section about [[#How question work|question working]] to understand more about hinting settings.

===I have a vague knowledge of regexes, but want to use pattern matching===
If creating regular expressions is a hard task for you, but you want to use their strength as patterns, you may make heavy use of authoring tools to create you questions. Authoring tools shows you a meaning of you expression in different way: an internal structure of expression(syntax tree), a visual path of matching (explaining graph) and a text description. They also allows you to test you regex against several strings and see, whether it work as expected. Experiment and play changing you regexes, see corresponding changes in authoring tools, and eventually you may get regex you want.

Read the section on [[#Authoring tools|authoring tools]], than (probably after some experimenting with tools on you own) a start of section about [[#Understanding regular expressions|understanding regular expressions]] (this is optional, but may be interesting and help a lot). You should also read a section about [[#How question work|question working]] to better understand various settings and how they affects you questions.

===I could spend some effort to learn regular expressions well and be able to do anything I they could===
If you don't know well regular expression, but want to understand them really well and create complex regexes easy; if you want to know what you doing writing you regexes, instead of blunt trying, you should spent some time and effort understanding it. Do not worry - it's not as hard as it sounds.

If you want to do that, read section about [[#Understanding regular expressions|understanding regular expressions]]. Then read slightly about [[#Authoring tools|authoring tools]] and use them to experiment creating regexes. With these tools you could see, if you really understand them well and they behave as expected. Syntax tree may be of special use for you, when you try to getting right meaning of ''precedence'' and ''arity''. After you understand well principles of regular expression, read sections about [[#How question work|question working]] and [[#Regular expressions reference|regular expression reference]] (to know you possibilities, don't bother to understand or remember them all - just look there periodically for something new to learn). Now you should be able to write you regexes without much use of authoring tools, except testing tool to test you regexes.

===I know regular expressions well enought to write them on my own without further guidance===
You should read about [[#How question work|question working]] to understand various settings and question behaviour under them. You also may be interested in regex testing in [[#Authoring tools|authoring tools]] section, . Finally, [[#Regular expressions reference|regular expression reference]] may be of some use to you.

==How question work==
===Matching===
Matching means finding a part of the student's answer that suits the regular expression. This part called '''match'''. You should enter regular expressions as '''answers''' to questions without modifiers or enclosing characters (modifiers will be added for you by the question - "u" is added always and "i" is added in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regex) to be shown to the student as '''correct answer'''. The question will use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) then a partial match that is the shortest to complete will be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine can't tell which one will be the shortest to complete.

Basically, this question type is an extended version of the Shortanswer.

===Notations===
Starting from Preg 2.1, the "notations" feature allows you to choose a notation in which regexes for answers will be written. The exciting part of notations is that you can use the Preg question type just as improved shortanswer, having access to the hinting without any need to understand regular expressions!
* The '''Regular expression''' which means Perl-compatible regex dialect is the default one. Line breaks will be ignored - you can use them freely to structure big regexes.
* The '''Regular expression (extended)''' notation is there for a really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not part of character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything from (unescaped and not part of character class) # character to the end of that string is treated as commentary.
* Choose the '''Moodle shortanswer''' notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

===Hinting===
Some matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

====Next character hinting====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

====Next lexem hinting====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language they are a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports only two languages (but there will be more):
* '''simple english''' - a simple lexer, that recognize words, numbers and punctuation;
* '''C/C++ language''' - a programming language C (or C++).

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. You can enter another word in the question description.

====General hinting rules====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with 50% then 100% grades will be used and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match in the student response.

===Subpattern capturing and feedback===
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work on the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do it by its nature, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Looking for missing and misplaced things===
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

NFA and DFA matchers for now don't supports complex assertions, used by these regexes. Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's no restriction actually.

===Matching engines===
A matching engine means different program code that performs matching. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.
====PHP preg extension====
It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.

====Deterministic finite state automata (DFA)====
WARNING: This engine lacking support in the past year. Use NFA engine instead if you could (i.e. don't get rejected regexes), it allows almost anything DFA engine could, but much more tested and stable.

This is a custom PHP code that uses DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support can still differ from the standard. On the bright side it is support '''hinting'''.

Currently supported operands:
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)
* unicode properties

Currently supported operators:
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that can't be supported by DFA matching at all:
* subpattern capturing
* backreferences

====Non-deterministing finite state automata(NFA)====
NFA engine was introduced in the 2.1 release. It is a custom matcher that can do everything that DFA matcher can, but also supports:
* subpattern capturing (including named subpatterns, duplicate subpatterns numbers)
* backreference capturing (including named backreferences)

So, you don't have to choose between hiting and subpattern capturing in you questions - NFA can do them both! Also, the NFA matcher is more stable than the DFA one and it is probably the best choise if you want to use partial matching with hinting, but without lookaround assertions in main (hinting) regular expressions.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and it's parts), and test it. Authoring tools are activated by pressing "edit" icon near regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
# **syntax tree** - shows you an inner structure of regular expression;
# **explaining graph** - shows you how you expression will work in a graphical way;
# **description** - formulate the meaning of you expression in the english language;
# **testing tool** - allows you to enter strings and see how they match with you regexes.

===Regular expression area===
There you cold enter (or edit) regular expression and refresh all the tools from it.

TODO You could also select a part of regular expression and it will be mapped on the tree, graph and description.

===Syntax tree===
As was said above, regular expression is in fact expression - a tree of operators and operands. Syntax tree shows graphically this inner structure of expression: what is inside what.

If you don't understand operator and precedence well, it may have a small meaning to you. But it is still useful to find out, where you need parenthesis: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown by dotted part of the tree.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

===Explaining graph===
The graph shows how regular expression works. It's nodes are matched characters, it's edges shows paths throught the nodes from beginning to the end.
[[Image:qtype preg authortools4.png|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers what part of string matched with each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember part of the match, you may speed up you expression using (?: ) instead of ( ) parenthesis, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

TODO Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work.

===Testing tool===
You may enter a set of strings there, for matching against you expression. You'll see a coloured strings, showing which part of you string matched with expressions, so you could test, if it performing as you expcted.

TODO The strings will be saved in database when saving the question, if you save regex (they will be lost if you close window with "cancel" button).

==Understanding regular expressions==

===Understanding expressions in general===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash.

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

TODO - write more about operands and operators, but simple.

===Precedence and order of evaluation===
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

===Anchoring===
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

==Regular expressions reference==

===Operands===
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#* full list of characters needs escaping '''\ ^ $ . [ ] | ( ) ? * + { }'''
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

===Operators===
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

===Subpatterns and backreferences===
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

====Duplicate subpattern numbers and names====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

===Complex assertions===
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

===Local case-sensitivity modifiers===
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Error reporting===
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

==The ways to give back==
This software is considered scientific project and as such needs help any scientific project can use:
* an evidence, that results of our work (i.e. Preg questoin type) is really useful to people and was used in production environment;
* a cooperative work to research it's effectiveness for various applications - basically you need to write about how you used this question type and make some survey with you teachers and/or students about it - but it can include co-authoring a conference thesis or journal article;
* cooperating in writing article or help in publishing it in english-language journals (information and help in grants for further work is welcome too).

If you consider any way of helping, do not hesitate to write me about it and ask any questions about details. You may receive individual help during such work too (for example, doing cooperative research I may give you tips how to improve you regexes, etc).

I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts - there are many settings in the question, and regex can be quite complex, so it's hard to do all testing by developers themselves.

==Development plans==
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Improve a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2013-08-22T20:32:58Z

Oasychev: Major rearranging of material in the new sections, links to sections added to "how to use" section

{{Questions}}The Preg question type is a question type that uses regular expressions (regexes) to check student's responses (thought you can use it without regexes for it's hinting features). Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. First section should guide you in using of this docs, please use it with discreption. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm. There are many good regular expression manuals, I'm not going to repeat it there.

Authors:
# Idea, design, question type and behaviours code, hinting and error reporting - Oleg Sychev.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore and unicode support - Valeriy Streltsov.
# DFA regex matching engine - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Explaining tree, regular expression testing (authoring tools) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type.

==Ways to use Preg questions and this docs==
Regardless of the way you use Preg question and you capabilities, you could aways [[#The ways to give back|give back]]. It is hard to get any sort of feedback with freely distributed software. But you shoudn't expect to get software which ideally suitting you needs without telling anyone about these needs, or encouragement, or some non-difficult support to the authors. Sometimes as little as writing where you work and how you use (or what prevents you from using) Preg question type in it may help a lot.

===I don't (want) know anything about regular expressions but next word(character) hinting seems useful===
You can use Preg question type just as Shortanswer questions with advanced hinting, without any knowledge about regular expressions. To do this, you need to choose
* '''Notation''' => Moodle shortanswer
* '''Engine''' => Non-deterministic finite state automata
* '''Exact matching''' => Yes

After that, you can just copy answers from you shortanswer questions. You may want to read section about [[#How question work|question working]] to understand more about hinting settings.

===I have a vague knowledge of regexes, but want to use pattern matching===
If creating regular expressions is a hard task for you, but you want to use their strength as patterns, you may make heavy use of authoring tools to create you questions. Authoring tools shows you a meaning of you expression in different way: an internal structure of expression(syntax tree), a visual path of matching (explaining graph) and a text description. They also allows you to test you regex against several strings and see, whether it work as expected. Experiment and play changing you regexes, see corresponding changes in authoring tools, and eventually you may get regex you want.

Read the section on [[#Authoring tools|authoring tools]], than (probably after some experimenting with tools on you own) a start of section about [[#Understanding regular expressions|understanding regular expressions]] (this is optional, but may be interesting and help a lot). You should also read a section about [[#How question work|question working]] to better understand various settings and how they affects you questions.

===I could spend some effort to learn regular expressions well and be able to do anything I they could===
If you don't know well regular expression, but want to understand them really well and create complex regexes easy; if you want to know what you doing writing you regexes, instead of blunt trying, you should spent some time and effort understanding it. Do not worry - it's not as hard as it sounds.

If you want to do that, read section about [[#Understanding regular expressions|understanding regular expressions]]. Then read slightly about [[#Authoring tools|authoring tools]] and use them to experiment creating regexes. With these tools you could see, if you really understand them well and they behave as expected. Syntax tree may be of special use for you, when you try to getting right meaning of ''precedence'' and ''arity''. After you understand well principles of regular expression, read sections about [[#How question work|question working]] and [[#Regular expression reference|regular expression reference]] (to know you possibilities, don't bother to understand or remember them all - just look there periodically for something new to learn). Now you should be able to write you regexes without much use of authoring tools, except testing tool to test you regexes.

===I know regular expressions well enought to write them on my own without further guidance===
You should read about [[#How question work|question working]] to understand various settings and question behaviour under them. You also may be interested in regex testing in [[#Authoring tools|authoring tools]] section, . Finally, [[#Regular expression reference|regular expression reference]] may be of some use to you.

==How question work==
===Matching===
Matching means finding a part of the student's answer that suits the regular expression. This part called '''match'''. You should enter regular expressions as '''answers''' to questions without modifiers or enclosing characters (modifiers will be added for you by the question - "u" is added always and "i" is added in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regex) to be shown to the student as '''correct answer'''. The question will use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) then a partial match that is the shortest to complete will be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine can't tell which one will be the shortest to complete.

Basically, this question type is an extended version of the Shortanswer.

===Notations===
Starting from Preg 2.1, the "notations" feature allows you to choose a notation in which regexes for answers will be written. The exciting part of notations is that you can use the Preg question type just as improved shortanswer, having access to the hinting without any need to understand regular expressions!
* The '''Regular expression''' which means Perl-compatible regex dialect is the default one. Line breaks will be ignored - you can use them freely to structure big regexes.
* The '''Regular expression (extended)''' notation is there for a really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not part of character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything from (unescaped and not part of character class) # character to the end of that string is treated as commentary.
* Choose the '''Moodle shortanswer''' notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

===Hinting===
Some matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

====Next character hinting====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

====Next lexem hinting====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language they are a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports only two languages (but there will be more):
* '''simple english''' - a simple lexer, that recognize words, numbers and punctuation;
* '''C/C++ language''' - a programming language C (or C++).

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. You can enter another word in the question description.

====General hinting rules====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with 50% then 100% grades will be used and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match in the student response.

===Subpattern capturing and feedback===
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work on the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do it by its nature, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Looking for missing and misplaced things===
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

NFA and DFA matchers for now don't supports complex assertions, used by these regexes. Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's no restriction actually.

===Matching engines===
A matching engine means different program code that performs matching. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.
====PHP preg extension====
It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.

====Deterministic finite state automata (DFA)====
WARNING: This engine lacking support in the past year. Use NFA engine instead if you could (i.e. don't get rejected regexes), it allows almost anything DFA engine could, but much more tested and stable.

This is a custom PHP code that uses DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support can still differ from the standard. On the bright side it is support '''hinting'''.

Currently supported operands:
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)
* unicode properties

Currently supported operators:
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that can't be supported by DFA matching at all:
* subpattern capturing
* backreferences

====Non-deterministing finite state automata(NFA)====
NFA engine was introduced in the 2.1 release. It is a custom matcher that can do everything that DFA matcher can, but also supports:
* subpattern capturing (including named subpatterns, duplicate subpatterns numbers)
* backreference capturing (including named backreferences)

So, you don't have to choose between hiting and subpattern capturing in you questions - NFA can do them both! Also, the NFA matcher is more stable than the DFA one and it is probably the best choise if you want to use partial matching with hinting, but without lookaround assertions in main (hinting) regular expressions.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and it's parts), and test it. Authoring tools are activated by pressing "edit" icon near regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
# **syntax tree** - shows you an inner structure of regular expression;
# **explaining graph** - shows you how you expression will work in a graphical way;
# **description** - formulate the meaning of you expression in the english language;
# **testing tool** - allows you to enter strings and see how they match with you regexes.

===Regular expression area===
There you cold enter (or edit) regular expression and refresh all the tools from it.

TODO You could also select a part of regular expression and it will be mapped on the tree, graph and description.

===Syntax tree===
As was said above, regular expression is in fact expression - a tree of operators and operands. Syntax tree shows graphically this inner structure of expression: what is inside what.

If you don't understand operator and precedence well, it may have a small meaning to you. But it is still useful to find out, where you need parenthesis: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown by dotted part of the tree.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

===Explaining graph===
The graph shows how regular expression works. It's nodes are matched characters, it's edges shows paths throught the nodes from beginning to the end.
[[Image:qtype preg authortools4.png|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers what part of string matched with each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember part of the match, you may speed up you expression using (?: ) instead of ( ) parenthesis, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

TODO Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work.

===Testing tool===
You may enter a set of strings there, for matching against you expression. You'll see a coloured strings, showing which part of you string matched with expressions, so you could test, if it performing as you expcted.

TODO The strings will be saved in database when saving the question, if you save regex (they will be lost if you close window with "cancel" button).

==Understanding regular expressions==

===Understanding expressions in general===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash.

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

TODO - write more about operands and operators, but simple.

===Precedence and order of evaluation===
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

===Anchoring===
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

==Regular expressions reference==

===Operands===
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#* full list of characters needs escaping '''\ ^ $ . [ ] | ( ) ? * + { }'''
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

===Operators===
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

===Subpatterns and backreferences===
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

====Duplicate subpattern numbers and names====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

===Complex assertions===
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

===Local case-sensitivity modifiers===
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Error reporting===
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

==The ways to give back==
This software is considered scientific project and as such needs help any scientific project can use:
* an evidence, that results of our work (i.e. Preg questoin type) is really useful to people and was used in production environment;
* a cooperative work to research it's effectiveness for various applications - basically you need to write about how you used this question type and make some survey with you teachers and/or students about it - but it can include co-authoring a conference thesis or journal article;
* cooperating in writing article or help in publishing it in english-language journals (information and help in grants for further work is welcome too).

If you consider any way of helping, do not hesitate to write me about it and ask any questions about details. You may receive individual help during such work too (for example, doing cooperative research I may give you tips how to improve you regexes, etc).

I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts - there are many settings in the question, and regex can be quite complex, so it's hard to do all testing by developers themselves.

==Development plans==
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Improve a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2013-08-22T19:56:42Z

Oasychev: Wrote a draft of section about docs usage, change level of headers

{{Questions}}The Preg question type is a question type that uses regular expressions (regexes) to check student's responses. Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. This documentation contains a part about expressions in general, a part about regular expressions as a particular case of expressions, and a part about Preg question type itself. If you are familiar with regex syntax you may skip the first two parts and go to [[#Usage of the Preg question type|usage of the Preg question type]]. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm. There are many good regular expression manuals, I'm not going to repeat it there.

Authors:
# Idea, design, question type and behaviours code, hinting and error reporting - Oleg Sychev.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore and unicode support - Valeriy Streltsov.
# DFA regex matching engine - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Explaining tree, regular expression testing (authoring tools) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type.

==Ways to use Preg questions and this docs==

===I don't (want) know anything about regular expressions but next word(character) hinting seems useful===
You can use Preg question type just as Shortanswer questions with advanced hinting, without any knowledge about regular expressions. To do this, you need to choose
* '''Notation''' => Moodle shortanswer
* '''Engine''' => Non-deterministic finite state automata

After that, you can just copy answers from you shortanswer questions. You may want to read section about question working to understand more about hinting settings.

===I have a vague knowledge of regexes, but want to use pattern matching===
If creating regular expressions is a hard task for you, but you want to use their strength as patterns, you may make heavy use of authoring tools to create you questions. Authoring tools shows you a meaning of you expression in different way: an internal structure of expression(syntax tree), a visual path of matching (explaining graph) and a text description. They also allows you to test you regex against several strings and see, whether it work as expected. Experiment and play changing you regexes, see corresponding changes in authoring tools, and eventually you may get regex you want.

Read the section on authoring tools, than (maybe after some experimenting) a start of section about understanding regular expressions. You should also read a section about question working to better understand various settings and how they affects you questions.

===I could spend some effort to learn regular expressions well and be able to do anything I they could===
If you don't know well regular expression, but want to understand them really well and create complex regexes easy; if you want to know what you doing writing you regexes, instead of blunt trying, you should spent some time and effort understanding it. Do not worry - it's not as hard as it sounds.

If you want to do that, read section about understanding regular expressions. Then read slightly about authoring tools and use them to experiment creating regexes to see, if you really understand them well and they behave as expected. Syntax tree may be of special use for you, when you try to getting right meaning of ''precedence'' and ''arity''. After you understand well principles of regular expression, read sections about question working and regular expression reference (to know you possibilities). Now you should be able to write you regexes without much use of authoring tools, except testing tool to test you regexes.

===I know regular expressions well enought to write them on my own without further guidance===
You should read about question working to understand various settings and question behaviour under them. You also may be interested in regex testing in authoring tools section. Finally, regular expression reference may be of some use to you.

==Understanding expressions==
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

==Regular expressions==
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. These special characters are:

'''\ ^ $ . [ ] | ( ) ? * + { }'''

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

===Operands===
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

===Operators===
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

===Precedence and order of evaluation===
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

===Subpatterns and backreferences===
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

====Duplicate subpattern numbers and names====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

===Complex assertions===
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

===Matching===
Matching means finding a part of the student's answer that suits the regular expression. This part called '''match'''. You should enter regular expressions as '''answers''' to questions without modifiers or enclosing characters (modifiers will be added for you by the question - "u" is added always and "i" is added in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regex) to be shown to the student as '''correct answer'''. The question will use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) then a partial match that is the shortest to complete will be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine can't tell which one will be the shortest to complete.

===Anchoring===
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

===Local case-sensitivity modifiers===
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

==Usage of the Preg question type==
Basically, this question type is an extended version of the Shortanswer.

===Notations===
Starting from Preg 2.1, the "notations" feature allows you to choose a notation in which regexes for answers will be written. The exciting part of notations is that you can use the Preg question type just as improved shortanswer, having access to the hinting without any need to understand regular expressions!
* The '''Regular expression''' which means Perl-compatible regex dialect is the default one. Line breaks will be ignored - you can use them freely to structure big regexes.
* The '''Regular expression (extended)''' notation is there for a really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not part of character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything from (unescaped and not part of character class) # character to the end of that string is treated as commentary.
* Choose the '''Moodle shortanswer''' notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

===Hinting===
Some matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

====Next character hinting====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

====Next lexem hinting====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language they are a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports only two languages (but there will be more):
* '''simple english''' - a simple lexer, that recognize words, numbers and punctuation;
* '''C/C++ language''' - a programming language C (or C++).

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. You can enter another word in the question description.

====General hinting rules====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with 50% then 100% grades will be used and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match in the student response.

===Subpattern capturing and feedback===
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work on the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do it by its nature, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Error reporting===
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

===Looking for missing and misplaced things===
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

NFA and DFA matchers for now don't supports complex assertions, used by these regexes. Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's no restriction actually.

===Matching engines===
A matching engine means different program code that performs matching. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.
====PHP preg extension====
It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.

====Deterministic finite state automata (DFA)====
WARNING: This engine lacking support in the past year. Use NFA engine instead if you could (i.e. don't get rejected regexes), it allows almost anything DFA engine could, but much more tested and stable.

This is a custom PHP code that uses DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support can still differ from the standard. On the bright side it is support '''hinting'''.

Currently supported operands:
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)
* unicode properties

Currently supported operators:
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that can't be supported by DFA matching at all:
* subpattern capturing
* backreferences

====Non-deterministing finite state automata(NFA)====
NFA engine was introduced in the 2.1 release. It is a custom matcher that can do everything that DFA matcher can, but also supports:
* subpattern capturing (including named subpatterns, duplicate subpatterns numbers)
* backreference capturing (including named backreferences)

So, you don't have to choose between hiting and subpattern capturing in you questions - NFA can do them both! Also, the NFA matcher is more stable than the DFA one and it is probably the best choise if you want to use partial matching with hinting, but without lookaround assertions in main (hinting) regular expressions.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and it's parts), and test it. Authoring tools are activated by pressing "edit" icon near regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
# **syntax tree** - shows you an inner structure of regular expression;
# **explaining graph** - shows you how you expression will work in a graphical way;
# **description** - formulate the meaning of you expression in the english language;
# **testing tool** - allows you to enter strings and see how they match with you regexes.

===Regular expression area===
There you cold enter (or edit) regular expression and refresh all the tools from it.

TODO You could also select a part of regular expression and it will be mapped on the tree, graph and description.

===Syntax tree===
As was said above, regular expression is in fact expression - a tree of operators and operands. Syntax tree shows graphically this inner structure of expression: what is inside what.

If you don't understand operator and precedence well, it may have a small meaning to you. But it is still useful to find out, where you need parenthesis: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown by dotted part of the tree.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

===Explaining graph===
The graph shows how regular expression works. It's nodes are matched characters, it's edges shows paths throught the nodes from beginning to the end.
[[Image:qtype preg authortools4.png|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers what part of string matched with each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember part of the match, you may speed up you expression using (?: ) instead of ( ) parenthesis, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

TODO Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work.

===Testing tool===
You may enter a set of strings there, for matching against you expression. You'll see a coloured strings, showing which part of you string matched with expressions, so you could test, if it performing as you expcted.

TODO The strings will be saved in database when saving the question, if you save regex (they will be lost if you close window with "cancel" button).

==The ways to give back==
I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts, either by performing manual test or by writing unit tests (it's easy to do even if you aren't a great programmer, you just need to know regular expressions - contact me and I'll tell you how).

==Development plans==
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Improve a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Plik:qtype preg authortools4.png

2013-08-22T14:34:52Z

Oasychev: Oasychev uploaded a new version of "File:qtype preg authortools4.png": Deleted wide unused area on the left.

Preg question type

2013-08-22T14:31:14Z

Oasychev: /* Explaining graph */ - fixing image link

{{Questions}}The Preg question type is a question type that uses regular expressions (regexes) to check student's responses. Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. This documentation contains a part about expressions in general, a part about regular expressions as a particular case of expressions, and a part about Preg question type itself. If you are familiar with regex syntax you may skip the first two parts and go to [[#Usage of the Preg question type|usage of the Preg question type]]. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm.

Authors:
# Idea, design, question type and behaviours code, hinting and error reporting - Oleg Sychev.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore and unicode support - Valeriy Streltsov.
# DFA regex matching engine - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Explaining tree, regular expression testing (authoring tools) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type.

===Understanding expressions===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. These special characters are:

'''\ ^ $ . [ ] | ( ) ? * + { }'''

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

====Operands====
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

====Operators====
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

====Precedence and order of evaluation====
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

====Subpatterns and backreferences====
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

=====Duplicate subpattern numbers and names=====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

====Assertions====
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

====Matching====
Matching means finding a part of the student's answer that suits the regular expression. This part called '''match'''. You should enter regular expressions as '''answers''' to questions without modifiers or enclosing characters (modifiers will be added for you by the question - "u" is added always and "i" is added in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regex) to be shown to the student as '''correct answer'''. The question will use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) then a partial match that is the shortest to complete will be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine can't tell which one will be the shortest to complete.

====Anchoring====
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

====Local case-sensitivity modifiers====
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Usage of the Preg question type===
Basically, this question type is an extended version of the Shortanswer.

====Notations====
Starting from Preg 2.1, the "notations" feature allows you to choose a notation in which regexes for answers will be written. The exciting part of notations is that you can use the Preg question type just as improved shortanswer, having access to the hinting without any need to understand regular expressions!
* The '''Regular expression''' which means Perl-compatible regex dialect is the default one. Line breaks will be ignored - you can use them freely to structure big regexes.
* The '''Regular expression (extended)''' notation is there for a really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not part of character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything from (unescaped and not part of character class) # character to the end of that string is treated as commentary.
* Choose the '''Moodle shortanswer''' notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

====Hinting====
Some matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

=====Next character hinting=====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

=====Next lexem hinting=====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language it's a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports only two languages (but there will be more):
* '''simple english''' - a simple lexer, that recognize words, numbers and punctuation;
* '''C/C++ language''' - a programming language C (or C++).

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. You can enter another word in the question description.

=====General hinting rules=====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with 50% then 100% grades will be used and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match in the student response.

====Subpattern capturing and feedback====
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work on the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do it by its nature, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

====Error reporting====
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example, a quantifier {2,4} placed at the start of a regular expression loses the meaning as a quantifier and is treated as a five-characters sequence instead (that matches with the string "{2,4}"). However such syntax is very prone to errors and makes writing regular expression harder.

For now I vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE. If you stand for or against this decision then please write you positions and reasons to the comments. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

====Looking for missing and misplaced things====
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

NFA and DFA matchers for now don't supports complex assertions, used by these regexes. Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's no restriction actually.

====Matching engines====
A matching engine means different program code that performs matching. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.
=====PHP preg extension=====
It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.

=====Deterministic finite state automata (DFA)=====
WARNING: This engine lacking support in the past year. Use NFA engine instead if you could (i.e. don't get rejected regexes), it allows almost anything DFA engine could, but much more tested and stable.

This is a custom PHP code that uses DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support can still differ from the standard. On the bright side it is support '''hinting'''.

Currently supported operands:
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)
* unicode properties

Currently supported operators:
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that can't be supported by DFA matching at all:
* subpattern capturing
* backreferences

=====Non-deterministing finite state automata(NFA)=====
NFA engine was introduced in the 2.1 release. It is a custom matcher that can do everything that DFA matcher can, but also supports:
* subpattern capturing (including named subpatterns, duplicate subpatterns numbers)
* backreference capturing (including named backreferences)

So, you don't have to choose between hiting and subpattern capturing in you questions - NFA can do them both! Also, the NFA matcher is more stable than the DFA one and it is probably the best choise if you want to use partial matching with hinting, but without lookaround assertions in main (hinting) regular expressions.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and it's parts), and test it. Authoring tools are activated by pressing "edit" icon near regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
# **syntax tree** - shows you an inner structure of regular expression;
# **explaining graph** - shows you how you expression will work in a graphical way;
# **description** - formulate the meaning of you expression in the english language;
# **testing tool** - allows you to enter strings and see how they match with you regexes.

===Regular expression area===
There you cold enter (or edit) regular expression and refresh all the tools from it.

TODO You could also select a part of regular expression and it will be mapped on the tree, graph and description.

===Syntax tree===
As was said above, regular expression is in fact expression - a tree of operators and operands. Syntax tree shows graphically this inner structure of expression: what is inside what.

If you don't understand operator and precedence well, it may have a small meaning to you. But it is still useful to find out, where you need parenthesis: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown by dotted part of the tree.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

===Explaining graph===
The graph shows how regular expression works. It's nodes are matched characters, it's edges shows paths throught the nodes from beginning to the end.
[[Image:qtype preg authortools4.png|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers what part of string matched with each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember part of the match, you may speed up you expression using (?: ) instead of ( ) parenthesis, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

TODO Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work.

===Testing tool===
You may enter a set of strings there, for matching against you expression. You'll see a coloured strings, showing which part of you string matched with expressions, so you could test, if it performing as you expcted.

TODO The strings will be saved in database when saving the question, if you save regex (they will be lost if you close window with "cancel" button).

==The ways to give back==
I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts, either by performing manual test or by writing unit tests (it's easy to do even if you aren't a great programmer, you just need to know regular expressions - contact me and I'll tell you how).

==Development plans==
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Add a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop the backtracking matching engine
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2013-07-26T15:53:08Z

Oasychev:

{{Questions}}The Preg question type is a question type that uses regular expressions (regexes) to check student's responses. Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. This documentation contains a part about expressions in general, a part about regular expressions as a particular case of expressions, and a part about Preg question type itself. If you are familiar with regex syntax you may skip the first two parts and go to [[#Usage of the Preg question type|usage of the Preg question type]]. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm.

Authors:
# Idea, design, question type and behaviours code, hinting and error reporting - Oleg Sychev.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore and unicode support - Valeriy Streltsov.
# DFA regex matching engine - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Explaining tree, regular expression testing (authoring tools) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type.

===Understanding expressions===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. These special characters are:

'''\ ^ $ . [ ] | ( ) ? * + { }'''

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

====Operands====
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

====Operators====
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

====Precedence and order of evaluation====
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

====Subpatterns and backreferences====
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

=====Duplicate subpattern numbers and names=====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

====Assertions====
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

====Matching====
Matching means finding a part of the student's answer that suits the regular expression. This part called '''match'''. You should enter regular expressions as '''answers''' to questions without modifiers or enclosing characters (modifiers will be added for you by the question - "u" is added always and "i" is added in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regex) to be shown to the student as '''correct answer'''. The question will use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) then a partial match that is the shortest to complete will be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine can't tell which one will be the shortest to complete.

====Anchoring====
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

====Local case-sensitivity modifiers====
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Usage of the Preg question type===
Basically, this question type is an extended version of the Shortanswer.

====Notations====
Starting from Preg 2.1, the "notations" feature allows you to choose a notation in which regexes for answers will be written. The exciting part of notations is that you can use the Preg question type just as improved shortanswer, having access to the hinting without any need to understand regular expressions!
* The '''Regular expression''' which means Perl-compatible regex dialect is the default one. Line breaks will be ignored - you can use them freely to structure big regexes.
* The '''Regular expression (extended)''' notation is there for a really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not part of character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything from (unescaped and not part of character class) # character to the end of that string is treated as commentary.
* Choose the '''Moodle shortanswer''' notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

====Hinting====
Some matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

=====Next character hinting=====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

=====Next lexem hinting=====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language it's a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports only two languages (but there will be more):
* '''simple english''' - a simple lexer, that recognize words, numbers and punctuation;
* '''C/C++ language''' - a programming language C (or C++).

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. You can enter another word in the question description.

=====General hinting rules=====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with 50% then 100% grades will be used and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match in the student response.

====Subpattern capturing and feedback====
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work on the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do it by its nature, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

====Error reporting====
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example, a quantifier {2,4} placed at the start of a regular expression loses the meaning as a quantifier and is treated as a five-characters sequence instead (that matches with the string "{2,4}"). However such syntax is very prone to errors and makes writing regular expression harder.

For now I vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE. If you stand for or against this decision then please write you positions and reasons to the comments. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

====Looking for missing and misplaced things====
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

NFA and DFA matchers for now don't supports complex assertions, used by these regexes. Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's no restriction actually.

====Matching engines====
A matching engine means different program code that performs matching. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.
=====PHP preg extension=====
It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.

=====Deterministic finite state automata (DFA)=====
WARNING: This engine lacking support in the past year. Use NFA engine instead if you could (i.e. don't get rejected regexes), it allows almost anything DFA engine could, but much more tested and stable.

This is a custom PHP code that uses DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support can still differ from the standard. On the bright side it is support '''hinting'''.

Currently supported operands:
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)
* unicode properties

Currently supported operators:
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that can't be supported by DFA matching at all:
* subpattern capturing
* backreferences

=====Non-deterministing finite state automata(NFA)=====
NFA engine was introduced in the 2.1 release. It is a custom matcher that can do everything that DFA matcher can, but also supports:
* subpattern capturing (including named subpatterns, duplicate subpatterns numbers)
* backreference capturing (including named backreferences)

So, you don't have to choose between hiting and subpattern capturing in you questions - NFA can do them both! Also, the NFA matcher is more stable than the DFA one and it is probably the best choise if you want to use partial matching with hinting, but without lookaround assertions in main (hinting) regular expressions.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and it's parts), and test it. Authoring tools are activated by pressing "edit" icon near regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
# **syntax tree** - shows you an inner structure of regular expression;
# **explaining graph** - shows you how you expression will work in a graphical way;
# **description** - formulate the meaning of you expression in the english language;
# **testing tool** - allows you to enter strings and see how they match with you regexes.

===Regular expression area===
There you cold enter (or edit) regular expression and refresh all the tools from it.

TODO You could also select a part of regular expression and it will be mapped on the tree, graph and description.

===Syntax tree===
As was said above, regular expression is in fact expression - a tree of operators and operands. Syntax tree shows graphically this inner structure of expression: what is inside what.

If you don't understand operator and precedence well, it may have a small meaning to you. But it is still useful to find out, where you need parenthesis: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown by dotted part of the tree.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

===Explaining graph===
The graph shows how regular expression works. It's nodes are matched characters, it's edges shows paths throught the nodes from beginning to the end.
[[Image:qtype preg authortools4.jpg|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers what part of string matched with each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember part of the match, you may speed up you expression using (?: ) instead of ( ) parenthesis, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

TODO Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work.

===Testing tool===
You may enter a set of strings there, for matching against you expression. You'll see a coloured strings, showing which part of you string matched with expressions, so you could test, if it performing as you expcted.

TODO The strings will be saved in database when saving the question, if you save regex (they will be lost if you close window with "cancel" button).

==The ways to give back==
I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts, either by performing manual test or by writing unit tests (it's easy to do even if you aren't a great programmer, you just need to know regular expressions - contact me and I'll tell you how).

==Development plans==
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Add a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop the backtracking matching engine
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2013-07-26T15:51:21Z

Oasychev: /* Development plans */

{{Questions}}The Preg question type is a question type that uses regular expressions (regexes) to check student's responses. Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. This documentation contains a part about expressions in general, a part about regular expressions as a particular case of expressions, and a part about Preg question type itself. If you are familiar with regex syntax you may skip the first two parts and go to [[#Usage of the Preg question type|usage of the Preg question type]]. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm.

Authors:
# Idea, design, question type and behaviours code, hinting and error reporting - Oleg Sychev.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore and unicode support, explaining tree (authoring tool) - Valeriy Streltsov.
# DFA regex matching engine - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Explaining tree, regular expression testing (authoring tools) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type.

===Understanding expressions===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. These special characters are:

'''\ ^ $ . [ ] | ( ) ? * + { }'''

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

====Operands====
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

====Operators====
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

====Precedence and order of evaluation====
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

====Subpatterns and backreferences====
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

=====Duplicate subpattern numbers and names=====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

====Assertions====
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

====Matching====
Matching means finding a part of the student's answer that suits the regular expression. This part called '''match'''. You should enter regular expressions as '''answers''' to questions without modifiers or enclosing characters (modifiers will be added for you by the question - "u" is added always and "i" is added in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regex) to be shown to the student as '''correct answer'''. The question will use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) then a partial match that is the shortest to complete will be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine can't tell which one will be the shortest to complete.

====Anchoring====
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

====Local case-sensitivity modifiers====
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Usage of the Preg question type===
Basically, this question type is an extended version of the Shortanswer.

====Notations====
Starting from Preg 2.1, the "notations" feature allows you to choose a notation in which regexes for answers will be written. The exciting part of notations is that you can use the Preg question type just as improved shortanswer, having access to the hinting without any need to understand regular expressions!
* The '''Regular expression''' which means Perl-compatible regex dialect is the default one. Line breaks will be ignored - you can use them freely to structure big regexes.
* The '''Regular expression (extended)''' notation is there for a really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not part of character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything from (unescaped and not part of character class) # character to the end of that string is treated as commentary.
* Choose the '''Moodle shortanswer''' notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

====Hinting====
Some matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

=====Next character hinting=====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

=====Next lexem hinting=====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language it's a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports only two languages (but there will be more):
* '''simple english''' - a simple lexer, that recognize words, numbers and punctuation;
* '''C/C++ language''' - a programming language C (or C++).

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. You can enter another word in the question description.

=====General hinting rules=====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with 50% then 100% grades will be used and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match in the student response.

====Subpattern capturing and feedback====
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work on the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do it by its nature, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

====Error reporting====
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example, a quantifier {2,4} placed at the start of a regular expression loses the meaning as a quantifier and is treated as a five-characters sequence instead (that matches with the string "{2,4}"). However such syntax is very prone to errors and makes writing regular expression harder.

For now I vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE. If you stand for or against this decision then please write you positions and reasons to the comments. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

====Looking for missing and misplaced things====
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

NFA and DFA matchers for now don't supports complex assertions, used by these regexes. Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's no restriction actually.

====Matching engines====
A matching engine means different program code that performs matching. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.
=====PHP preg extension=====
It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.

=====Deterministic finite state automata (DFA)=====
WARNING: This engine lacking support in the past year. Use NFA engine instead if you could (i.e. don't get rejected regexes), it allows almost anything DFA engine could, but much more tested and stable.

This is a custom PHP code that uses DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support can still differ from the standard. On the bright side it is support '''hinting'''.

Currently supported operands:
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)
* unicode properties

Currently supported operators:
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that can't be supported by DFA matching at all:
* subpattern capturing
* backreferences

=====Non-deterministing finite state automata(NFA)=====
NFA engine was introduced in the 2.1 release. It is a custom matcher that can do everything that DFA matcher can, but also supports:
* subpattern capturing (including named subpatterns, duplicate subpatterns numbers)
* backreference capturing (including named backreferences)

So, you don't have to choose between hiting and subpattern capturing in you questions - NFA can do them both! Also, the NFA matcher is more stable than the DFA one and it is probably the best choise if you want to use partial matching with hinting, but without lookaround assertions in main (hinting) regular expressions.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and it's parts), and test it. Authoring tools are activated by pressing "edit" icon near regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
# **syntax tree** - shows you an inner structure of regular expression;
# **explaining graph** - shows you how you expression will work in a graphical way;
# **description** - formulate the meaning of you expression in the english language;
# **testing tool** - allows you to enter strings and see how they match with you regexes.

===Regular expression area===
There you cold enter (or edit) regular expression and refresh all the tools from it.

TODO You could also select a part of regular expression and it will be mapped on the tree, graph and description.

===Syntax tree===
As was said above, regular expression is in fact expression - a tree of operators and operands. Syntax tree shows graphically this inner structure of expression: what is inside what.

If you don't understand operator and precedence well, it may have a small meaning to you. But it is still useful to find out, where you need parenthesis: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown by dotted part of the tree.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

===Explaining graph===
The graph shows how regular expression works. It's nodes are matched characters, it's edges shows paths throught the nodes from beginning to the end.
[[Image:qtype preg authortools4.jpg|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers what part of string matched with each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember part of the match, you may speed up you expression using (?: ) instead of ( ) parenthesis, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

TODO Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work.

===Testing tool===
You may enter a set of strings there, for matching against you expression. You'll see a coloured strings, showing which part of you string matched with expressions, so you could test, if it performing as you expcted.

TODO The strings will be saved in database when saving the question, if you save regex (they will be lost if you close window with "cancel" button).

==The ways to give back==
I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts, either by performing manual test or by writing unit tests (it's easy to do even if you aren't a great programmer, you just need to know regular expressions - contact me and I'll tell you how).

==Development plans==
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Add a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop the backtracking matching engine
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2013-07-26T15:50:50Z

Oasychev: /* The ways to give back */

{{Questions}}The Preg question type is a question type that uses regular expressions (regexes) to check student's responses. Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. This documentation contains a part about expressions in general, a part about regular expressions as a particular case of expressions, and a part about Preg question type itself. If you are familiar with regex syntax you may skip the first two parts and go to [[#Usage of the Preg question type|usage of the Preg question type]]. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm.

Authors:
# Idea, design, question type and behaviours code, hinting and error reporting - Oleg Sychev.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore and unicode support, explaining tree (authoring tool) - Valeriy Streltsov.
# DFA regex matching engine - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Explaining tree, regular expression testing (authoring tools) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type.

===Understanding expressions===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. These special characters are:

'''\ ^ $ . [ ] | ( ) ? * + { }'''

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

====Operands====
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

====Operators====
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

====Precedence and order of evaluation====
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

====Subpatterns and backreferences====
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

=====Duplicate subpattern numbers and names=====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

====Assertions====
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

====Matching====
Matching means finding a part of the student's answer that suits the regular expression. This part called '''match'''. You should enter regular expressions as '''answers''' to questions without modifiers or enclosing characters (modifiers will be added for you by the question - "u" is added always and "i" is added in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regex) to be shown to the student as '''correct answer'''. The question will use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) then a partial match that is the shortest to complete will be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine can't tell which one will be the shortest to complete.

====Anchoring====
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

====Local case-sensitivity modifiers====
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Usage of the Preg question type===
Basically, this question type is an extended version of the Shortanswer.

====Notations====
Starting from Preg 2.1, the "notations" feature allows you to choose a notation in which regexes for answers will be written. The exciting part of notations is that you can use the Preg question type just as improved shortanswer, having access to the hinting without any need to understand regular expressions!
* The '''Regular expression''' which means Perl-compatible regex dialect is the default one. Line breaks will be ignored - you can use them freely to structure big regexes.
* The '''Regular expression (extended)''' notation is there for a really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not part of character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything from (unescaped and not part of character class) # character to the end of that string is treated as commentary.
* Choose the '''Moodle shortanswer''' notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

====Hinting====
Some matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

=====Next character hinting=====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

=====Next lexem hinting=====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language it's a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports only two languages (but there will be more):
* '''simple english''' - a simple lexer, that recognize words, numbers and punctuation;
* '''C/C++ language''' - a programming language C (or C++).

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. You can enter another word in the question description.

=====General hinting rules=====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with 50% then 100% grades will be used and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match in the student response.

====Subpattern capturing and feedback====
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work on the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do it by its nature, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

====Error reporting====
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example, a quantifier {2,4} placed at the start of a regular expression loses the meaning as a quantifier and is treated as a five-characters sequence instead (that matches with the string "{2,4}"). However such syntax is very prone to errors and makes writing regular expression harder.

For now I vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE. If you stand for or against this decision then please write you positions and reasons to the comments. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

====Looking for missing and misplaced things====
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

NFA and DFA matchers for now don't supports complex assertions, used by these regexes. Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's no restriction actually.

====Matching engines====
A matching engine means different program code that performs matching. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.
=====PHP preg extension=====
It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.

=====Deterministic finite state automata (DFA)=====
WARNING: This engine lacking support in the past year. Use NFA engine instead if you could (i.e. don't get rejected regexes), it allows almost anything DFA engine could, but much more tested and stable.

This is a custom PHP code that uses DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support can still differ from the standard. On the bright side it is support '''hinting'''.

Currently supported operands:
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)
* unicode properties

Currently supported operators:
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that can't be supported by DFA matching at all:
* subpattern capturing
* backreferences

=====Non-deterministing finite state automata(NFA)=====
NFA engine was introduced in the 2.1 release. It is a custom matcher that can do everything that DFA matcher can, but also supports:
* subpattern capturing (including named subpatterns, duplicate subpatterns numbers)
* backreference capturing (including named backreferences)

So, you don't have to choose between hiting and subpattern capturing in you questions - NFA can do them both! Also, the NFA matcher is more stable than the DFA one and it is probably the best choise if you want to use partial matching with hinting, but without lookaround assertions in main (hinting) regular expressions.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and it's parts), and test it. Authoring tools are activated by pressing "edit" icon near regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
# **syntax tree** - shows you an inner structure of regular expression;
# **explaining graph** - shows you how you expression will work in a graphical way;
# **description** - formulate the meaning of you expression in the english language;
# **testing tool** - allows you to enter strings and see how they match with you regexes.

===Regular expression area===
There you cold enter (or edit) regular expression and refresh all the tools from it.

TODO You could also select a part of regular expression and it will be mapped on the tree, graph and description.

===Syntax tree===
As was said above, regular expression is in fact expression - a tree of operators and operands. Syntax tree shows graphically this inner structure of expression: what is inside what.

If you don't understand operator and precedence well, it may have a small meaning to you. But it is still useful to find out, where you need parenthesis: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown by dotted part of the tree.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

===Explaining graph===
The graph shows how regular expression works. It's nodes are matched characters, it's edges shows paths throught the nodes from beginning to the end.
[[Image:qtype preg authortools4.jpg|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers what part of string matched with each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember part of the match, you may speed up you expression using (?: ) instead of ( ) parenthesis, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

TODO Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work.

===Testing tool===
You may enter a set of strings there, for matching against you expression. You'll see a coloured strings, showing which part of you string matched with expressions, so you could test, if it performing as you expcted.

TODO The strings will be saved in database when saving the question, if you save regex (they will be lost if you close window with "cancel" button).

==The ways to give back==
I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts, either by performing manual test or by writing unit tests (it's easy to do even if you aren't a great programmer, you just need to know regular expressions - contact me and I'll tell you how).

===Development plans===
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Add a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop the backtracking matching engine
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Plik:qtype preg authortools71.png

2013-07-26T15:48:44Z

Oasychev: Oasychev uploaded a new version of "File:qtype preg authortools71.png"

Dyskusja pliku:qtype preg authortools7.jpg

2013-07-26T15:48:26Z

Oasychev: Created page with "this file is obsolete and could be deleted, only I don't seems to have rights to do it. Its replacement has png extension, so wiki don't allow to upload it as a new version. -..."

this file is obsolete and could be deleted, only I don't seems to have rights to do it.
Its replacement has png extension, so wiki don't allow to upload it as a new version.
--[[User:Oleg Sychev|Oleg Sychev]] ([[User talk:Oleg Sychev|talk]]) 23:48, 26 July 2013 (WST)

Preg question type

2013-07-26T15:44:32Z

Oasychev: /* Explaining graph */ - updated picture 7

{{Questions}}The Preg question type is a question type that uses regular expressions (regexes) to check student's responses. Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. This documentation contains a part about expressions in general, a part about regular expressions as a particular case of expressions, and a part about Preg question type itself. If you are familiar with regex syntax you may skip the first two parts and go to [[#Usage of the Preg question type|usage of the Preg question type]]. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm.

Authors:
# Idea, design, question type and behaviours code, hinting and error reporting - Oleg Sychev.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore and unicode support, explaining tree (authoring tool) - Valeriy Streltsov.
# DFA regex matching engine - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Explaining tree, regular expression testing (authoring tools) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type.

===Understanding expressions===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. These special characters are:

'''\ ^ $ . [ ] | ( ) ? * + { }'''

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

====Operands====
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

====Operators====
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

====Precedence and order of evaluation====
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

====Subpatterns and backreferences====
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

=====Duplicate subpattern numbers and names=====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

====Assertions====
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

====Matching====
Matching means finding a part of the student's answer that suits the regular expression. This part called '''match'''. You should enter regular expressions as '''answers''' to questions without modifiers or enclosing characters (modifiers will be added for you by the question - "u" is added always and "i" is added in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regex) to be shown to the student as '''correct answer'''. The question will use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) then a partial match that is the shortest to complete will be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine can't tell which one will be the shortest to complete.

====Anchoring====
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

====Local case-sensitivity modifiers====
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Usage of the Preg question type===
Basically, this question type is an extended version of the Shortanswer.

====Notations====
Starting from Preg 2.1, the "notations" feature allows you to choose a notation in which regexes for answers will be written. The exciting part of notations is that you can use the Preg question type just as improved shortanswer, having access to the hinting without any need to understand regular expressions!
* The '''Regular expression''' which means Perl-compatible regex dialect is the default one. Line breaks will be ignored - you can use them freely to structure big regexes.
* The '''Regular expression (extended)''' notation is there for a really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not part of character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything from (unescaped and not part of character class) # character to the end of that string is treated as commentary.
* Choose the '''Moodle shortanswer''' notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

====Hinting====
Some matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

=====Next character hinting=====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

=====Next lexem hinting=====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language it's a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports only two languages (but there will be more):
* '''simple english''' - a simple lexer, that recognize words, numbers and punctuation;
* '''C/C++ language''' - a programming language C (or C++).

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. You can enter another word in the question description.

=====General hinting rules=====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with 50% then 100% grades will be used and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match in the student response.

====Subpattern capturing and feedback====
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work on the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do it by its nature, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

====Error reporting====
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example, a quantifier {2,4} placed at the start of a regular expression loses the meaning as a quantifier and is treated as a five-characters sequence instead (that matches with the string "{2,4}"). However such syntax is very prone to errors and makes writing regular expression harder.

For now I vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE. If you stand for or against this decision then please write you positions and reasons to the comments. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

====Looking for missing and misplaced things====
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

NFA and DFA matchers for now don't supports complex assertions, used by these regexes. Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's no restriction actually.

====Matching engines====
A matching engine means different program code that performs matching. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.
=====PHP preg extension=====
It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.

=====Deterministic finite state automata (DFA)=====
WARNING: This engine lacking support in the past year. Use NFA engine instead if you could (i.e. don't get rejected regexes), it allows almost anything DFA engine could, but much more tested and stable.

This is a custom PHP code that uses DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support can still differ from the standard. On the bright side it is support '''hinting'''.

Currently supported operands:
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)
* unicode properties

Currently supported operators:
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that can't be supported by DFA matching at all:
* subpattern capturing
* backreferences

=====Non-deterministing finite state automata(NFA)=====
NFA engine was introduced in the 2.1 release. It is a custom matcher that can do everything that DFA matcher can, but also supports:
* subpattern capturing (including named subpatterns, duplicate subpatterns numbers)
* backreference capturing (including named backreferences)

So, you don't have to choose between hiting and subpattern capturing in you questions - NFA can do them both! Also, the NFA matcher is more stable than the DFA one and it is probably the best choise if you want to use partial matching with hinting, but without lookaround assertions in main (hinting) regular expressions.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and it's parts), and test it. Authoring tools are activated by pressing "edit" icon near regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
# **syntax tree** - shows you an inner structure of regular expression;
# **explaining graph** - shows you how you expression will work in a graphical way;
# **description** - formulate the meaning of you expression in the english language;
# **testing tool** - allows you to enter strings and see how they match with you regexes.

===Regular expression area===
There you cold enter (or edit) regular expression and refresh all the tools from it.

TODO You could also select a part of regular expression and it will be mapped on the tree, graph and description.

===Syntax tree===
As was said above, regular expression is in fact expression - a tree of operators and operands. Syntax tree shows graphically this inner structure of expression: what is inside what.

If you don't understand operator and precedence well, it may have a small meaning to you. But it is still useful to find out, where you need parenthesis: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown by dotted part of the tree.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

===Explaining graph===
The graph shows how regular expression works. It's nodes are matched characters, it's edges shows paths throught the nodes from beginning to the end.
[[Image:qtype preg authortools4.jpg|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers what part of string matched with each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember part of the match, you may speed up you expression using (?: ) instead of ( ) parenthesis, that will speed up matching.

[[Image:qtype preg authortools71.png|graph for regex (?:(abc)|de)f ]]

TODO Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work.

===Testing tool===
You may enter a set of strings there, for matching against you expression. You'll see a coloured strings, showing which part of you string matched with expressions, so you could test, if it performing as you expcted.

TODO The strings will be saved in database when saving the question, if you save regex (they will be lost if you close window with "cancel" button).

===The ways to give back===
I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts, either by performing manual test or by writing unit tests (it's easy to do even if you aren't a great programmer, you just need to know regular expressions - contact me and I'll tell you how).

===Development plans===
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Add a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop the backtracking matching engine
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Plik:qtype preg authortools71.png

2013-07-26T15:42:45Z

Oasychev:

Preg question type

2013-07-26T15:28:38Z

Oasychev: /* Authoring tools */

{{Questions}}The Preg question type is a question type that uses regular expressions (regexes) to check student's responses. Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. This documentation contains a part about expressions in general, a part about regular expressions as a particular case of expressions, and a part about Preg question type itself. If you are familiar with regex syntax you may skip the first two parts and go to [[#Usage of the Preg question type|usage of the Preg question type]]. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm.

Authors:
# Idea, design, question type and behaviours code, hinting and error reporting - Oleg Sychev.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore and unicode support, explaining tree (authoring tool) - Valeriy Streltsov.
# DFA regex matching engine - Dmitriy Kolesov.
# Explaining graph (authoring tool) - Vladimir Ivanov.
# Explaining tree, regular expression testing (authoring tools) - Grigory Terekhov.
# Regex description (authoring tool) - Dmitriy Pahomov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type.

===Understanding expressions===
Regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. These special characters are:

'''\ ^ $ . [ ] | ( ) ? * + { }'''

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

====Operands====
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the '''operators''' that allow us to define an expression that matches strings of several characters.

====Operators====
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

====Precedence and order of evaluation====
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

====Subpatterns and backreferences====
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

=====Duplicate subpattern numbers and names=====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

====Assertions====
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

====Matching====
Matching means finding a part of the student's answer that suits the regular expression. This part called '''match'''. You should enter regular expressions as '''answers''' to questions without modifiers or enclosing characters (modifiers will be added for you by the question - "u" is added always and "i" is added in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regex) to be shown to the student as '''correct answer'''. The question will use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) then a partial match that is the shortest to complete will be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine can't tell which one will be the shortest to complete.

====Anchoring====
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

====Local case-sensitivity modifiers====
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Usage of the Preg question type===
Basically, this question type is an extended version of the Shortanswer.

====Notations====
Starting from Preg 2.1, the "notations" feature allows you to choose a notation in which regexes for answers will be written. The exciting part of notations is that you can use the Preg question type just as improved shortanswer, having access to the hinting without any need to understand regular expressions!
* The '''Regular expression''' which means Perl-compatible regex dialect is the default one. Line breaks will be ignored - you can use them freely to structure big regexes.
* The '''Regular expression (extended)''' notation is there for a really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not part of character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything from (unescaped and not part of character class) # character to the end of that string is treated as commentary.
* Choose the '''Moodle shortanswer''' notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

====Hinting====
Some matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

=====Next character hinting=====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

=====Next lexem hinting=====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language it's a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports only two languages (but there will be more):
* '''simple english''' - a simple lexer, that recognize words, numbers and punctuation;
* '''C/C++ language''' - a programming language C (or C++).

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. You can enter another word in the question description.

=====General hinting rules=====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with 50% then 100% grades will be used and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match in the student response.

====Subpattern capturing and feedback====
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work on the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do it by its nature, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

====Error reporting====
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example, a quantifier {2,4} placed at the start of a regular expression loses the meaning as a quantifier and is treated as a five-characters sequence instead (that matches with the string "{2,4}"). However such syntax is very prone to errors and makes writing regular expression harder.

For now I vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE. If you stand for or against this decision then please write you positions and reasons to the comments. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

====Looking for missing and misplaced things====
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

NFA and DFA matchers for now don't supports complex assertions, used by these regexes. Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's no restriction actually.

====Matching engines====
A matching engine means different program code that performs matching. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.
=====PHP preg extension=====
It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.

=====Deterministic finite state automata (DFA)=====
WARNING: This engine lacking support in the past year. Use NFA engine instead if you could (i.e. don't get rejected regexes), it allows almost anything DFA engine could, but much more tested and stable.

This is a custom PHP code that uses DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support can still differ from the standard. On the bright side it is support '''hinting'''.

Currently supported operands:
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)
* unicode properties

Currently supported operators:
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that can't be supported by DFA matching at all:
* subpattern capturing
* backreferences

=====Non-deterministing finite state automata(NFA)=====
NFA engine was introduced in the 2.1 release. It is a custom matcher that can do everything that DFA matcher can, but also supports:
* subpattern capturing (including named subpatterns, duplicate subpatterns numbers)
* backreference capturing (including named backreferences)

So, you don't have to choose between hiting and subpattern capturing in you questions - NFA can do them both! Also, the NFA matcher is more stable than the DFA one and it is probably the best choise if you want to use partial matching with hinting, but without lookaround assertions in main (hinting) regular expressions.

==Authoring tools==

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and it's parts), and test it. Authoring tools are activated by pressing "edit" icon near regex field.

[[Image:qtype preg authortools1.png|authoring tools icon]]

There are four authoring tools available:
# **syntax tree** - shows you an inner structure of regular expression;
# **explaining graph** - shows you how you expression will work in a graphical way;
# **description** - formulate the meaning of you expression in the english language;
# **testing tool** - allows you to enter strings and see how they match with you regexes.

===Regular expression area===
There you cold enter (or edit) regular expression and refresh all the tools from it.

TODO You could also select a part of regular expression and it will be mapped on the tree, graph and description.

===Syntax tree===
As was said above, regular expression is in fact expression - a tree of operators and operands. Syntax tree shows graphically this inner structure of expression: what is inside what.

If you don't understand operator and precedence well, it may have a small meaning to you. But it is still useful to find out, where you need parenthesis: cf. trees for ''ab+'' (a) and ''(ab)+'' (b) on the picture below.
[[Image:qtype preg authortools2.png|parenthesis in the structure of regex]]

The part of expression you selected is shown by dotted part of the tree.

[[Image:qtype preg authortools3.jpg|leftmost node of the tree is selected]]

===Explaining graph===
The graph shows how regular expression works. It's nodes are matched characters, it's edges shows paths throught the nodes from beginning to the end.
[[Image:qtype preg authortools4.jpg|alternatives and concatenation]]

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

[[Image:qtype preg authortools5.jpg|graph for regex ^\dabc[!,0-9]$]]

Dotted rectangles shows you repeated parts of you expression.

[[Image:qtype preg authortools6.jpg|graph for regex \d*]]

Solid line rectangles show you subexpressions. When expression is matched, it remembers what part of string matched with each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember part of the match, you may speed up you expression using (?: ) instead of ( ) parenthesis, that will speed up matching.

[[Image:qtype preg authortools7.jpg|graph for regex (abc)|(?:abc)]]

TODO Green rectangle shows you selected part of expression.

===Description===
Description try to formulate a sentence, describing you how expression is supposed to work.

===Testing tool===
You may enter a set of strings there, for matching against you expression. You'll see a coloured strings, showing which part of you string matched with expressions, so you could test, if it performing as you expcted.

TODO The strings will be saved in database when saving the question, if you save regex (they will be lost if you close window with "cancel" button).

===The ways to give back===
I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts, either by performing manual test or by writing unit tests (it's easy to do even if you aren't a great programmer, you just need to know regular expressions - contact me and I'll tell you how).

===Development plans===
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Add a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop the backtracking matching engine
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Plik:qtype preg authortools4.png

2013-07-26T15:26:33Z

Oasychev:

Plik:qtype preg authortools2.png

2013-07-26T15:24:54Z

Oasychev: