Note: You are currently viewing documentation for Moodle 2.0. Up-to-date documentation for the latest stable version is available here: The OU PMatch algorithm.

The OU PMatch algorithm

From MoodleDocs
Revision as of 19:07, 26 October 2010 by Tim Hunt (talk | contribs) (New page: This is an algorithm that the OU hopes to turn into a Moodle question type. The following description is probably not sufficient to completely and unambiguously define the algorithm, but ...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

This is an algorithm that the OU hopes to turn into a Moodle question type.

The following description is probably not sufficient to completely and unambiguously define the algorithm, but it is the best I can do so far from the existing documentation and looking at the code.

You can see an implementation of this algorithm in Java at https://openmark.dev.java.net/source/browse/openmark/trunk/src/om/helper/PMatch.java?view=markup Some of that code is, however, pretty horrible. (Can I point out that I did not write it ;-).)

The following attempts to be a logical description of the rules. That is, it describes what the correct result of the matching rules should be, not the best way to determine that answer efficiently. I am pretty sure that it is impossible to implement this algorithm in anything less than exponential time. In any case, the aim is to provide a powerful language that is nice for question authors to use, rather than to make something that is easy to implement. Also, an implementation should focus on efficiently matching the kinds of expressions that authors typically use. Edge cases are not important.


Inputs

The algorithm takes three inputs. Typically take a particular string (that is, a student's response) and match it against a number of different pmatch expressions using a particular set of options.

Overall match options

ignorecase
(boolean)
sentencedividers
(list of char) defaults to just '.'
worddividers
(list of char) defaults to all the characters that match \s from PCRE (http://www.php.net/manual/en/regexp.reference.escape.php).
checkspellings
(boolean)
extradictionarywords
(list of words) defaults to the list (i.e. ie. e.g. eg. etc.) You will note from this list that dictionary words can include sentence or word dividers, and in this case, they characters do not function as divider characters.

The string to match

A string of Unicode characters.

A pmatch expression

See the next section for a description of the algorithm. A PMatch expressions is naturally represented internally as an expression tree.


Parsing the input string

1. If ignorecase is true, convert the string to lower case using a unicode-aware algorithm (textlib in Moodle).

2. Search the string for extradictionarywords, and mark any sentencedividers worddividers characters within those words, so they are no longer considered to be divider characters. Also mark any '.' characters that appear between two digits as non-dividers. They are assumed to be decimal points.

3. Trim any sentencedividers from the start or end of the string, and split the string on runs of consecutive sentencedividers. We end up with an array of strings.

4. Trim any worddividers characters from each of those strings, and split them on runs of consecutive worddividers.

Thus the input string has been converted to a list of lists of words. Each of those lists of words is a sentence.


Checking spellings

It is assumed we have access to some dictionary (Moodle provides aspell). Words can be checked against that dictionary, and the extradictionarywords from the options. Any misspelt words can be reported.


Expression syntax and meaning

A pmatch expression, when applied to a string, evaluates to a boolean. The expression can be made up from the following parts.

not

not( <pmatch_expression> )

is a pmatch expression that is true if <pmatch_expression> is false, and false if <pmatch_expression> is true.

all

all( <pmatch_expression1> , <pmatch_expression2> , ... )

is a pmatch expression that is true if all of <pmatch_expression1> , <pmatch_expression2> and so on are true. Accepts two or more sub-expressions.

any

any( <pmatch_expression1> , <pmatch_expression2> , ... )

is a pmatch expression that is true if any one of <pmatch_expression1> , <pmatch_expression2> and so on are true. Accepts two or more sub-expressions.

match_...

match_options( <word_sequence> )

is true if, using the specified options, the <word_sequence> matches the string.

Recognised options:

c
allow any number of extra characters in each word.
m2
allow two misspellings of the following types in words of 8 characters or more. One misspelling in shorter words.
m
allow one misspelling of the fillowing types.
mx
allow one extra character in the word in the string, compared to the one in the expression, providing the word in the expression contains 3 non-wildcard characters.
mf
allow one fewer character in the word in the string, compared to the one in the expression, providing the word in the expression contains 4 non-wildcard characters.
mr
allow one letter in the word in the string to differ from the equivalent letter in the word in the expressions, providing the word in the expression contains 4 non-wildcard characters.
o
allow the words in the string to be an any order.
w
allow the string to contain extra words in addition to the ones specified in the expression.
p0, p1, p2, p3 or p4
set the proximity for the _ joiner, see below. The default is 2.

Different options can be run together in any order. For example match_omfxp2w. In the case where there are no options, the _ must be omitted.

c cannot be used at the same time as any m option.

Matching word_sequences is the fundamental part of PMatch.


Matching word_sequences

word_sequences

A word_sequence comprises one or more <word_alternative>s separated by <joiner>s.

<word_alternative1> <joiner1> <word_alternative2> <joiner2> <word_alternative3>...

There are two possible joiners:

' '
(space) functions as a logical AND. If the 'o' match options is given the words in the string may be in any order, and still match, otherwise they must be in the order given in the expression.
'_'
requires the two (sets of) words to to be in the string, they must be in the order given (irrespective of the 'o' match option) and there must be no more than 'proximity' other words between them. In addition, both words must be within the same sentence.

The word sequence matches the string providing all the word_alternatives can be made to match portions of the string in a way that satisfies the constraints imposed by the joiners.

word_alternatives

A word_alternative comprises one or more word_patterns, separated by '|' characters.

<word_pattern1> '|' <word_pattern2> ...

A word alternative matches a word or words in the string. There may be more than one way for the word_alternative to match the string. All options must be considered.

word_patterns

A word pattern is either a

<word_with_wildcards>

or a

<bracket_expression>

A word pattern matches a word or words in the string as explained below.

bracket_expressions

'[' <simple_alternative1> ' ' <simple_alternative2> ... ']'

Two or more simple alternatives enclosed in square brackets. Each alternative is separated by spaces, meaning and, like in a word_sequence.

A bracket_expression matches a set of words in the string, providing each simple_alternative matches a different word. In addition, the words in the string must be in the same order as the words in the pattern, unless the '0' matching option is given.

simple_alternatives

A simple alternative is one or more word_with_wildcards separated by '|' characters.

<word_with_wildcards1> '|' <word_with_wildcards2> ...

A simple alternative matches a particular word in the string providing any one of the alternative word_with_wildcards does.

word_with_wildcards

This is a sequence of ordinary characters with * and ? wildcards. * stands for any string of zero or more characters. ? stands for any single character.

If the ignorecase option is true, then all the characters in the word_with_wildcards are converted to lower case when the expression is parsed (using a unicode-safe conversion function like Moodle's textlib).

A word_with_wildcards matches a word in the string providing there is some transformation of the word in the string according to the permitted m and c options that makes the word match the pattern.

There is no way to match any of the characters "[]| _" as literal characters within a word.

Of course, one word_with_wildcards may match any number of words in the string. All possible matches must be considered.


Suggested PHP API

This seems like a natural design to me, based on the typical usage of testing one string against several expressions.

Note that we will also need to validate the expressions that teacher type into the editing form when creating questions.

/**

* Options that control the overall way the matching is done.
*/

class pmatch_options {

   /** @var boolean */
   public $ignorecase;
   /** @var string of sentence divider characters. */
   public $sentencedividers = '.';
   /** @var string of word diveder characters. */
   public $worddividers = " \f\n\r\t";
   /**
    * @var array of words to recognise. These words may include sentence or
    * word divider characters.
    */
   public $extradictionarywords = array('e.g.', 'eg.', 'etc.', 'i.e.', 'ie.');

}

/**

* Represents a string that is ready for matching, and provides the method to
* match expressions against it.
*/

class pmatch_parsed_string {

   /**
    * Constructor.
    * @param string $string the string to match against.
    * @param pmatch_options $options the options to use.
    */
   public function __construct($string, pmatch_options $options);
   /**
    * Test this string with a given pmatch expression.
    * @param pmatch_expression $expression the expression to use.
    * @return boolean whether this string matches the expression.
    */
   public function matches(pmatch_expression $expression);
   /**
    * @return boolean returns true if any word is misspelt.
    */
   public function is_spelt_correctly();
   /**
    * @return array all the distinct misspelt words.
    */
   public function get_spelling_errors();
   /**
    * @return pmatch_options the options that were used to construct this object.
    */
   public function get_options();

}

/**

* Represents a pmatch_expression.
*/

class pmatch_expression {

   /**
    * 
    * @param string $string the string to match against.
    * @param pmatch_options $options the options to use.
    */
   public function __construct($string, pmatch_options $options);
   /**
    * @return boolean returns false if the string passed to the constructor
    * could not be parsed as a valid pmatch expression.
    */
   public function is_valid();
   /**
    * @return string description of the syntax error in the expression string
    * if is_valid returned false. Otherwise returns an empty string.
    */
   public function get_parse_error();
   /**
    * @return pmatch_options the options that were used to construct this object.
    */
   public function get_options();
   /**
    * @return string the expression, exactly as it was passed to the constructor.
    */
   public function get_original_expression_string();
   /**
    * @return string a nicely formatted version of the expression.
    */
   public function get_formatted_expression_string();

}


Simple usage example

$matchoptions = new pmatch_options(); $string = new pmatch_parsed_string('Oil is lighter', $matchoptions); if (!$string->is_spelt_correctly()) {

   // Do something.

} $expression = new pmatch_expression('

   all(
       any(
           match_mw(great*|high|higher|more|bigger|heavier|heavy dens*|[specific gravity]|sg than_water),
           match_mw(dens*|[specific gravity]|sg great*|high|higher|more|bigger|heavier|heavy than_water),
           match_mw(dens* water < dens* oil),
       ),
       not(match_w(than_oil))
   )

', $matchoptions); if (!$expression->is_valid()) {

   // Do something.

} $matches = $string->matches($expression);


A small selection of unit tests

These should serve to clarify the some of the rules above. They assume the suggested API above.

class pmatch_test {

   protected function match($string, $expression, $options = null) {
       if (is_null($options)) {
           $options = new pmatch_options();
       }
       $string = new pmatch_parsed_string($string, $options);
       $expression = new pmatch_expression($expression, $options);
       return $string->matches($expression);
   }
   public function test_pmatch() {
       // Tests from the original pmatch documentation.
       $this->assertTrue('tom dick harry', 'match(tom dick harry)'); // This is the exact match.
       $this->assertTrue('thomas', 'match_c(tom)'); // Extra characters are allowed anywhere within the word.
       $this->assertTrue('tom dick and harry', 'match_w(dick)'); // Extra words are allowed anywhere within the sentence.
       $this->assertTrue('harry dick tom', 'match_o(tom dick harry)'); // Any order of words is allowed.
       $this->assertTrue('rick', 'match_m(dick)'); // One character in the word can differ.
       $this->assertTrue('rick and harry and tom', 'match_mow(tom dick harry)');
       $this->assertTrue('dick and harry and thomas', 'match_cow(tom dick harry)');
       $this->assertTrue('arthur harry and sid', 'match_mow(tom|dick|harry)'); // Any of tom or dick or harry will be matched.
       $this->assertTrue('tomy harry and sid', 'match_mow(tom|dick harry|sid)'); // The pattern requires either tom or dick AND harry or sid.
       $this->assertTrue('tom was mesmerised by maud', 'match_mow([tom maud]|[sid jane])'); // The pattern requires either (tom and maud) or (sid and jane).
       $this->assertTrue('rick', 'match(?ick)'); // The first character can be anything.
       $this->assertTrue('harold', 'match(har*)'); // Any sequence of characters can follow 'har'.
       $this->assertTrue('tom married maud sid married jane', 'match_mow(tom_maud)'); // Only one word is between tom and maud.
       $this->assertFalse('maud married tom sid married jane', 'match_mow(tom_maud)'); // The proximity control also specifies word order and over-rides the 'o' matching option.
       $this->assertFalse('tom married maud sid married jane', 'match_mow(tom_jane)'); // Only two words are allowed between tom and jane.
       $this->assertTrue('tom married maud', 'match_mow(tom|thomas marr& maud)');
       $this->assertTrue('maud marries thomas', 'match_mow(tom|thomas marr& maud)');
       $this->assertTrue('tom is to marry maud', 'match_mow(tom|thomas marr& maud)');
       $this->assertTrue('tempratur', 'match_m2ow(temperature)'); // Two characters are missing.
       $this->assertTrue('temporatur', 'match_m2ow(temperature)'); // Two characters are incorrect; one has been replaced and one is missing.
       // Tests of the misspelling rules.
       $this->assertTrue('test', 'match(test)');
       $this->assertFalse('tes', 'match(test)');
       $this->assertFalse('testt', 'match(test)');
       $this->assertFalse('tent', 'match(test)');
       $this->assertFalse('tets', 'match(test)');
       $this->assertTrue('test', 'match_mf(test)');
       $this->assertTrue('tes', 'match_mf(test)');
       $this->assertFalse('testt', 'match_mf(test)');
       $this->assertFalse('tent', 'match_mf(test)');
       $this->assertFalse('tets', 'match_mf(test)');
       $this->assertFalse('te', 'match_mf(tes)');
       // and so on.
   }

}

Template:CategoryDeveloper