The OU PMatch algorithm

Jump to: navigation, search
Note: The Open University's Pattern Match question type is available from the Moodle plugins database.


This is an algorithm that the OU hopes to turn into a Moodle question type.

The code for the existing Java implementation can be seen at https://openmark.dev.java.net/source/browse/openmark/trunk/src/om/helper/PMatch.java?view=markup.

The following attempts to be a logical description of the algorithm. That is, it describes what the result of matching an expression against a string. It does not describe the best way to determine that result efficiently. Implementing these rules in full generality is very hard to do efficiently. It is more important that common expressions can be matched efficiently against typical strings, than that complex edge cases perform well.

A small selection of typical pmatch expressions can be seen at the end of the document, in the section on unit test cases.


Inputs

The algorithm takes three inputs. Typically it takes a particular string (that is, a student's response) and matches it against a number of different pmatch expressions using a particular set of options.

Overall match options

ignorecase 
(boolean)
sentencedividers 
(list of char) defaults to just '.'
worddividers 
(list of char) defaults to all the characters that match \s from PCRE (http://www.php.net/manual/en/regexp.reference.escape.php).
checkspellings 
(boolean)
extradictionarywords 
(list of words) defaults to the list (i.e. ie. e.g. eg. etc.) You will note from this list that dictionary words can include sentence or word dividers, and in this case, they characters do not function as divider characters.

The string to match

A string of Unicode characters.

A pmatch expression

See below.


Parsing the input string

1. If ignorecase is true, convert the string to lower case using a unicode-aware algorithm (textlib in Moodle).

2. Search the string for extradictionarywords, and mark any sentencedividers and worddividers characters within those words, so they are no longer considered to be divider characters. Also mark any '.' characters that appear between two digits as non-dividers. They are assumed to be decimal points.

3. The whole response is treated as a single string. Any sentencedividers are only considered in proximity checks. Following (2) the positions of sentence dividers should now be known.

4. Trim any worddividers characters from each of those strings, and split them on runs of consecutive worddividers.

Thus the input string has been converted to a list of lists of words.


Checking spellings

It is assumed we have access to some dictionary (Moodle provides aspell). Words in the parsed input string can be checked against that dictionary and the extradictionarywords from the options. Words that are in neither the dictionary nor extradictionarywords can be reported as misspelled words.


Expression syntax and meaning

Before evaluating a pmatch expression, all runs of consecutive white-space should be replaced by a single space. (That is preg_replace('/\s+/', ' ', $expression);) In addition, white-space is ignored either side of the '(' and ')' of 'not', 'all', 'any' and 'match*', and either side of the ',' in 'all' and 'any'.

I describe the algorithm in a top-down way. You may prefer to read the description from the bottom up.

A pmatch expression, when applied to a string, evaluates to a boolean. The expression can be made up from the following parts:

not

not( <pmatch_expression> )

is a pmatch expression that is true if <pmatch_expression> is false, and false if <pmatch_expression> is true.

all

all( <pmatch_expression1> , <pmatch_expression2> , ... )

is a pmatch expression that is true if all of <pmatch_expression1> , <pmatch_expression2> and so on are true. Accepts two or more sub-expressions.

any

any( <pmatch_expression1> , <pmatch_expression2> , ... )

is a pmatch expression that is true if any one of <pmatch_expression1> , <pmatch_expression2> and so on are true. Accepts two or more sub-expressions.

match_...

match_options( <word_sequence> )

or

match( <word_sequence> )

is true if, using the specified options, the <word_sequence> matches the string. This is defined below.

Recognised options:

allow any number of extra characters in a word in the string.
m2 
allow two misspellings of the following types when the word_with_wildcards contains 8 or more non-wildcard characters. For shorter word_with_wildcards, treat as m.
allow any one misspelling of the following types.
mx 
allow one extra character in the word in the string when matching against a word_with_wildcards, providing the word_with_wildcards contains 3 or more non-wildcard characters.
mf 
allow one fewer character in the word in the string when matching against a word_with_wildcards, providing the word_with_wildcards contains 4 or more non-wildcard characters.
mr 
allow one letter in the word in the string to differ from the equivalent letter in the word_with_wildcards, providing the word_with_wildcards contains 4 or more non-wildcard characters.
mt 
allow two neighbouring letters in the word in the string to be transposed when matching against a word_with_wildcards, providing the word_with_wildcards contains 4 or more non-wildcard characters.
allow the words in the string to be an any order. See the definition of the ' ' (space) joiner below.
allow the string to contain extra words in addition to the ones matched by the expression.
p0, p1, p2, p3 or p4 
set the proximity for the '_' joiner below. The default is 2 if no pn option is given.

Different options can be run together in any order. For example match_omfxp2w. In the case where there are no options, the _ must be omitted.

c cannot be used at the same time as any m option.

Matching word_sequences

This is the fundamental part of the pmatch algorithm.

word_sequence

A word_sequence comprises one or more word_alternatives separated by joiners.

<word_alternative1> <joiner1> <word_alternative2> <joiner2> <word_alternative3>...

The word_sequence matches the string if each word_alternative can be made to match a different word or words in the string in such a way that satisfies the constraints imposed by the joiners. In addition, if the 'w' match option is not given, all words in the string must be matched by one of the word_alternatives.

There are two possible joiners:

' ' 
(space) Both the word_alternatives must be present in the string. If the 'o' option is not given, then the last word matched by the first word_alternative must come before the first word matched by the second word_alternative.
'_' 
Both the word_alternatives must be present in the string. The last word matched by the first word_alternative must come before the first word matched by the second word_alternative (even if the 'o' option is used). There must be no more than 'proximity' other words between the last word matched by the first word_alternative and the first word matched by the second word_alternative. All the words matched by both word_alternatives must be within the same sentence.

word_alternative

A word_alternative comprises one or more word_patterns, separated by '|' characters.

<word_pattern1> '|' <word_pattern2> ...

A word alternative matches a word or words in the string if any one of the word_patterns matches that set of words.

There may be more than one way for the word_alternative to match a word or words the string. All possible matches must be considered when trying to find a way to make a word_sequence match a string.

word_pattern

A word pattern is either a

<word_with_wildcards>

or a

<bracket_expression>

bracket_expression

A bracket expression two or more simple_alternatives separated by joiners enclosed in square brackets.

'[' <simple_alternative1> <joiner1> <simple_alternative2> ... ']'


A bracket_expression matches a set of words in the string, providing each simple_alternative matches a different word in the string. The joiners mean the same as in a word_sequence with two exceptions:

  1. we are now joining simple_alternatives rather than word_alternatives, and
  2. If there is a '_' joiner before or after the word_alternative of which this bracket_expression is a part, then the ' ' joiner must match the simple_alternatives in order. That is, the 'o' option is disabled within bracket_expressions when they are part of a proximity match.

simple_alternative

A simple alternative is one or more word_with_wildcards separated by '|' characters.

<word_with_wildcards1> '|' <word_with_wildcards2> ...

A simple_alternative matches a particular word in the string providing one or more of the word_with_wildcards do.

word_with_wildcards

This is a sequence of ordinary characters with * and ? wildcards. * stands for any string of zero or more characters. ? stands for any single character.

If the ignorecase option is true, then all the characters in the word_with_wildcards are converted to lower case when the expression is parsed (using a Unicode-safe conversion function like Moodle's textlib).

Characters that have a special meaning in pmatch ('(', ')', ' ', '_', '|', '[', ']', '?', '*' and '\') can be matched as literal characters by escaping them with a \ in the word_with_wildcards.

A word_with_wildcards matches a word in the string providing there is some transformation of the word in the string according to the permitted m and c options that makes the word match the pattern.

One word_with_wildcards may match any number of words in the string. All possible matches must be considered.


Suggested PHP API

This seems like a natural design to me, based on the typical usage of testing one string against several expressions.

Note the methods for validating expressions that the teacher types into the editing form when creating questions.

/**
 * Options that control the overall way the matching is done.
 */
class pmatch_options {
    /** @var boolean */
    public $ignorecase;
 
    /** @var string of sentence divider characters. */
    public $sentencedividers = '.';
 
    /** @var string of word diveder characters. */
    public $worddividers = " \f\n\r\t";
 
    /**
     * @var array of words to recognise. These words may include sentence or
     * word divider characters.
     */
    public $extradictionarywords = array('e.g.', 'eg.', 'etc.', 'i.e.', 'ie.');
}
 
/**
 * Represents a string that is ready for matching, and provides the method to
 * match expressions against it.
 */
class pmatch_parsed_string {
    /**
     * Constructor.
     * @param string $string the string to match against.
     * @param pmatch_options $options the options to use.
     */
    public function __construct($string, pmatch_options $options);
 
    /**
     * Test this string with a given pmatch expression.
     * @param pmatch_expression $expression the expression to use.
     * @return boolean whether this string matches the expression.
     */
    public function matches(pmatch_expression $expression);
 
    /**
     * @return boolean returns true if any word is misspelt.
     */
    public function is_spelt_correctly();
 
    /**
     * @return array all the distinct misspelt words.
     */
    public function get_spelling_errors();
 
    /**
     * @return pmatch_options the options that were used to construct this object.
     */
    public function get_options();
}
 
/**
 * Represents a pmatch_expression.
 */
class pmatch_expression {
    /**
     * 
     * @param string $string the string to match against.
     * @param pmatch_options $options the options to use.
     */
    public function __construct($string, pmatch_options $options);
 
    /**
     * @return boolean returns false if the string passed to the constructor
     * could not be parsed as a valid pmatch expression.
     */
    public function is_valid();
 
    /**
     * @return string description of the syntax error in the expression string
     * if is_valid returned false. Otherwise returns an empty string.
     */
    public function get_parse_error();
 
    /**
     * @return pmatch_options the options that were used to construct this object.
     */
    public function get_options();
 
    /**
     * @return string the expression, exactly as it was passed to the constructor.
     */
    public function get_original_expression_string();
 
    /**
     * @return string a nicely formatted version of the expression.
     */
    public function get_formatted_expression_string();
}


Simple usage example

$matchoptions = new pmatch_options();
$string = new pmatch_parsed_string('Oil is lighter', $matchoptions);
if (!$string->is_spelt_correctly()) {
    // Do something.
}
$expression = new pmatch_expression('
    all(
        any(
            match_mw(great*|high|higher|more|bigger|heavier|heavy dens*|[specific gravity]|sg than_water),
            match_mw(dens*|[specific gravity]|sg great*|high|higher|more|bigger|heavier|heavy than_water),
            match_mw(dens* water < dens* oil),
        ),
        not(match_w(than_oil))
    )
', $matchoptions);
if (!$expression->is_valid()) {
    // Do something.
}
$matches = $string->matches($expression);


A small selection of unit tests

These should serve to clarify the some of the rules above. They assume the suggested API above.

Note that these are a tiny subset of all the tests I would write if I had to implement this algorithm.

Also note that there are far too many tests here that assert true, without corresponding tests that are similar but assert false.

class pmatch_test {
    protected function match($string, $expression, $options = null) {
        if (is_null($options)) {
            $options = new pmatch_options();
        }
        $string = new pmatch_parsed_string($string, $options);
        $expression = new pmatch_expression($expression, $options);
        return $string->matches($expression);
    }
 
    public function test_pmatch() {
        // Tests from the original pmatch documentation.
        $this->assertTrue('tom dick harry', 'match(tom dick harry)'); // This is the exact match.
        $this->assertTrue('thomas', 'match_c(tom)'); // Extra characters are allowed anywhere within the word.
        $this->assertTrue('tom dick and harry', 'match_w(dick)'); // Extra words are allowed anywhere within the sentence.
        $this->assertTrue('harry dick tom', 'match_o(tom dick harry)'); // Any order of words is allowed.
        $this->assertTrue('rick', 'match_m(dick)'); // One character in the word can differ.
        $this->assertTrue('rick and harry and tom', 'match_mow(tom dick harry)');
        $this->assertTrue('dick and harry and thomas', 'match_cow(tom dick harry)');
        $this->assertTrue('arthur harry and sid', 'match_mow(tom|dick|harry)'); // Any of tom or dick or harry will be matched.
        $this->assertTrue('tomy harry and sid', 'match_mow(tom|dick harry|sid)'); // The pattern requires either tom or dick AND harry or sid.
        $this->assertTrue('tom was mesmerised by maud', 'match_mow([tom maud]|[sid jane])'); // The pattern requires either (tom and maud) or (sid and jane).
        $this->assertTrue('rick', 'match(?ick)'); // The first character can be anything.
        $this->assertTrue('harold', 'match(har*)'); // Any sequence of characters can follow 'har'.
        $this->assertTrue('tom married maud sid married jane', 'match_mow(tom_maud)'); // Only one word is between tom and maud.
        $this->assertFalse('maud married tom sid married jane', 'match_mow(tom_maud)'); // The proximity control also specifies word order and over-rides the 'o' matching option.
        $this->assertFalse('tom married maud sid married jane', 'match_mow(tom_jane)'); // Only two words are allowed between tom and jane.
        $this->assertTrue('tom married maud', 'match_mow(tom|thomas marr& maud)');
        $this->assertTrue('maud marries thomas', 'match_mow(tom|thomas marr& maud)');
        $this->assertTrue('tom is to marry maud', 'match_mow(tom|thomas marr& maud)');
        $this->assertTrue('tempratur', 'match_m2ow(temperature)'); // Two characters are missing.
        $this->assertTrue('temporatur', 'match_m2ow(temperature)'); // Two characters are incorrect; one has been replaced and one is missing.
 
        // These tests come from questions Tim had to ask Phil when writing this specification.
        $this->assertTrue('cat toad frog', 'match(cat_[toad|newt frog]|dog)');
        $this->assertTrue('cat newt frog', 'match(cat_[toad|newt frog]|dog)');
        $this->assertTrue('cat dog', 'match(cat_[toad|newt frog]|dog)');
        $this->assertTrue('cat dog', 'match(cat_[toad|newt frog]|dog)');
        $this->assertTrue('x cat x x toad frog x', 'match_w(cat_[toad|newt frog]|dog)');
        $this->assertTrue('x cat newt x x x x x frog x', 'match_w(cat_[toad|newt frog]|dog)');
        $this->assertTrue('x cat x x dog x', 'match_w(cat_[toad|newt frog]|dog)');
        $this->assertFalse('A C B D', 'match([A B]_[C D])');
        $this->assertFalse('B C A D', 'match_o([A B]_[C D])');
        $this->assertTrue('A x x x x B C D', 'match_ow([A B]_[C D])');
        $this->assertFalse('B x x x x A C D', 'match_ow([A B]_[C D])'); // _ requires the words in [] to match in order.
        $this->assertFalse('A B C', 'match_ow([A B]_[B C])');
 
        // Tests of the misspelling rules.
        $this->assertTrue('test', 'match(test)');
        $this->assertFalse('tes', 'match(test)');
        $this->assertFalse('testt', 'match(test)');
        $this->assertFalse('tent', 'match(test)');
        $this->assertFalse('tets', 'match(test)');
 
        $this->assertTrue('test', 'match_mf(test)');
        $this->assertTrue('tes', 'match_mf(test)');
        $this->assertFalse('testt', 'match_mf(test)');
        $this->assertFalse('tent', 'match_mf(test)');
        $this->assertFalse('tets', 'match_mf(test)');
        $this->assertFalse('te', 'match_mf(tes)');
 
        // and so on.
    }
}