BioAuth Plugin

Note: This page is a work-in-progress. Feedback and suggested improvements are welcome. Please join the discussion on moodle.org or use the page comments.

BioAuth: A Moodle plugin for determining Quiz authorship
Project state	Community bonding period
Tracker issue	CONTRIB-4337
Discussion	XXX
Assignee	Vinnie Monaco

GSOC '13

Introduction

The purpose of the BioAuth plugin is to provide a mechanism for verifying a user's identity based on behavioral biometrics. This is accomplished by capturing keystrokes from a user and matching it against a known template for that user. The initial release of BioAuth will only support biometric authentication for essay-type questions on quizzes. Other sources of keystrokes may be added in future releases.

Installation

First, install the BioAuth local plugin. This contains everything needed to authenticate students but is not able to collect any data. Then install the quiz access rule BioLogger plugin. This allows keystroke timing information to be captured on client computers during quiz attempts.

Requirements

The BioAuth plugin requires a Moodle installation with a MySQL backend and java (>1.5) installed on the server.

Instructions

TODO: local_bioauth installation instructions

TODO: quiz_accessrule installation instructions

Usage

The BioAuth and BioLogger plugins must both be installed for the BioAuth plugin to work properly.

Once the plugin is installed, it can be enabled for existing courses through the report page. It will automatically be enabled for newly created courses.

Important: The course must contain at least one quiz with one essay question. The plugin can only log keystrokes from essay questions which use the rich html editor.

Background

The raw data is processed in several stages in order to make an authentication decision, and either confirm or deny the identity of an individual:

Feature extraction → Fallback Procedure → Outlier Removal → Normalization → Classification

A session is the duration in which some data is continuously captured from a user. This could be a quiz or some activity on a forum. A user may have multiple sessions capture while they are logged in. Sessions should capture data from an independent activity or task which lasts anywhere from roughly 10 to 60 minutes depending on the task involved.

A keystroke event is the atomic event captured by the hardware keyboard of a desktop or laptop computer. It consists of a press time, release time, and key identity. The key is identified by it's location on the keyboard, not the ASCII code of the letter which was actually printed on screen. For example, 'A' is the same as 'a,' even though they have different ASCII codes, and left shift is not the same as right shift, even though they have the same ASCII codes. The identity of the key can usually be determined by the key code, which should be unique for every key on the keyboard. Care must be taken to correlate key codes across different browsers and operating systems, as they may be platform dependent. For more information on events and keycodes, see Quirksmode on JavaScript keys.

A stylometry event is a portion of text that was emitted to the screen in a continuous manner. Stylometry events are separated by a change in focus or cursor position. Events that segment consecutive stylometry events (ending the current stylometry event and beginning a new one) include:

Changing the window focus
Clicking on a different area of text in the current entry
Using the arrow keys to navigate a text area
Moving to a different text area.

Using the delete key to modify text which was already printed on screen does not end the current stylometry event, although it does modify the text which is contained in the current stylometry event. Upon completion of a subtask, the sequence of stylometry events will roughly consist of the final text that the user actually typed and appeared on screen.

Feature Extraction

A "keystroke feature" is a measurement made on the time differences between the press and release events of keystrokes. The most commonly used keystroke features are means and standard deviations of key duration and transition times.

During feature extraction, the raw data (keystroke, stylometry) is mapped to a feature space. Features usually consist of measurements on some distribution in the raw data. The hold time is the duration a key is held, or press time – release time. There are several types of transition times, the most common being release-press and press-press latencies. For example, take two consecutive keystrokes, [A,B]. The hold time of A would be (A_release - A_press) and the transition times, (B_press – A_release), (B_press – A_press).

The hold times of A form a distribution from which the mean and standard deviation may be taken. Similarly for the transition times between A and B, assuming that the sequence [A,B] occurs with a high enough frequency in the sample provided. For missing data, a fallback procedure is used, described in the next section.

Fallback Procedure

Dealing with arbitrary can lead to missing observations. To compensate for this, a fallback procedure is used to compensate for missing data. Missing features may be computed from observations which are known to be correlated with the missing data.

It is known that the timing information for closely grouped keys correlate well enough to used them in an instance of missing data. For this reason, a fallback hierarchy based on the physiology of the touch-typist is used.

Outlier Removal

Due to inconsistencies in the data provided, outliers must be removed. It is common for a user to take a break while typing, creating a large transition time between keystrokes. These can easily be removed by excluding observations which fall outside the mean observation, usually +/- 2 standard deviations.

Normalization

Features are normalized so that each feature is weighted equally in the KNN classification. The normalization clamps each feature between 0 and 1.

Classification

The feature space is first transformed into a feature difference space by taking the differences between every combination of feature vectors. The differences make up two classes: differences within a user and differences between users. The unknown sample (query) is compared to the feature vectors of it's claimed identity. Differences between the query vector and the user's template are taken. These differences are then classified as either within user or between user by comparing them to the vectors that make up the difference space. A KNN algorithm is used to classify the unknown sample by looking at the closest neighbors, which came from either the same user or different users.

Settings

Mode

Currently, the plugin can either be Enabled or Disabled. Setting the mode to Enabled will activate biometric authentication for any new courses which are created. The plugin can be enabled for existing courses through the overview report page.

Weeks keep active

The number of weeks to keep a course active, after the plugin has been enabled. During this time, the plugin will continuously look for new data and re-authenticate students. The results are updated as new data is collected.

Min keystrokes per quiz

The desired number of keystrokes per quiz. This setting will only affect the time at which the plugin has determined enough data has been collected to begin authenticating students. See Percent data needed.

Percent data needed

How much data is needed before authentications should begin to take place. This only affects the initial results, which are continuously updated as new data is collected. The default setting is 50%, which would be 50% of the total data needed (students X quizzes X keystrokes-per-quiz)

Max concurrent jobs

The number of jobs which may be run simultaneously during a cron job.

Cache key codes

Use cache for looking up the keycodes when enrolling new data. This works very similarly to the language cache used in the string api.

Knn

The model parameter K, where K is the number of neighbors to use when making an authentication decision.

Min key frequency

The minimum number of keystroke occurrences before a fallback procedure is used. The fallback procedure is defined in the features file.

Decision mode

How the classifier should behave when making an authentication decision. In reality, there will always be at least a small error in the accuracy of the classifier. This setting will determine whether a false acceptance or false rejection should be weighted more heavily. A more secure system would reduce the number of false acceptances at the cost of increasing the number of false rejections. The tradeoff can be seen on the ROC curve in the quiz report.

Feature set

Which feature set to use. The features are defined in the feature files and loaded during installation.

Report

Overview

The overview report gives a summary of each course. The plugin can be enabled or disabled for any particular course by clicking the appropriate action button.

Quiz

The quiz report displays the authentication results for every student in the course. In each cell, a student can either be authenticated or rejected. A '-' will appear in cells where enough data has not been collected to make a decision.

The graph at the top of the page displays the tradeoff between the false acceptance and false rejection rate. This is the receiver operating characteristic (ROC) curve. It can be used to determine the overall performance of the classifier for the given population of students.

Multi-language support

The BioAuth plugin was designed to be easily extensible to other languages and locales. This can be done by defining the physical keys which may be pressed on the keyboard and creating the keystroke features. Defining the keys themselves is straightforward and only requires a mapping of the different agent/locale key codes to the name of the key. Creating the keystroke features requires some knowledge of the target language and application in order to create features which may easily identify an individual. Usually, keys which occur with a high frequency are good to use as features as are transitions between commonly-occuring keys.

Keys

Keycodes for various languages are defined in the keys/[language]/keycode.php file. The system will detect the current language and use the appropriate keycode file to translate the keycodes. This is needed because the keycodes vary across different browser agents and platforms. The different codes must all be mapped to the physical key that they were generated by.

Features

Features are defined in the features/ folder. Feature sets are not tied to any particular language because the same feature set may be used for multiple languages or there may be several feature sets for one language. The installation comes with a feature set designed for standard QWERTY keyboards and uses a touch-type hierarchy fallback procedure.

Schedule

The work will be completed under the following schedule:

Dates	Task	Status
May 27 - June 17	Draft of design, to be accomplished during Community Bonding Period. Also, deadline for submitting a paper to MRC2013.	Complete
June 17 - July 1	Binary classifier and authentication decision functions	Complete
July 1 - July 15	Key logger DB model	Complete
July 15 - August 1	Keystroke Feature extractor	Complete
August 1 -August 11	UI for results and settings	Complete
August 15- Sept. 1	Beta testing with trusted volunteers. Bug fixes and final report.	Complete
Sept. 1 - Sept. 23	Prepare for release. Scrub code, tests, documentation. Live testing during Fall 2013 academic semester	Complete

Documentation