Item analysis theoretical background

I am trying to understand what the state of the art is with Item analysis. I am struggling to find my way into the literature, but have started this page to record my progress. So far the following references were found after several hours in the OU library around classmark 371.26.--Tim Hunt 05:58, 30 November 2007 (CST)

References I have got old of

J J Barnard, Item Analysis in Test Construction pp. 195-206, in Geofferey N Masteres & John P Keeves 1999 Advances in Measurement in Educational Research and Assessment, Pergamon.

Mentions two topics "Classical Test Theory" and "Item Response Theory".
"Item analysis is not a substitute for the originality, effort and skill of the item writer and relatively poor statistical results can be overruled on logical grounds."
"The two most basic statistics computed and examined during item analysis are the items' difficulty and values."
The difficulty if basically the average score for the item. The higher the average score, the easier it is.
When computing this average, you have to decide whether to ignore students who did not submit an answer, or to include them as zero score. And you have to consider that in a timed test, questions near the end are more likely to be missed.
For discrimination, there are different techniques for questions with and without partial scores.
For questions that are scored 0/1 (dichotomously scored) point biserial correlation is most commonly used.
You should allow for the fact that the score for this item is included in the score for the whole test. However, for tests with many questions, the correction is small.
Item reliability index: (Gulliksen's product) r_it S_i.
The above is Classical Test Theory.
Item Response Theory is based on more computer-intensive techniques, involving fitting models to the data (maximum likelihood estimation).
"It can be concluded that CTT and IRT should be be viewed as rival theoretical frameworks. A duet, rather than a dual bewteen CTT and IRT will provide most information to the test developer. The results obtained from a CTT based item anaysis can yiedl useful information in finding flaws in items and guiding the test developer towards choosing an appropriate IRT model. The advantages that IRT parameters offer should subsequently be used for constructiong tests for specific purposes, ..."

R L Ebel 1972, Essentials of Educational Measurement, Prentice Hall.

This is very good on interpreting the numbers, and on wider issues of test construction, but does not consider more advanced mathematical techniques.
There was a new edition of this book by Ebel And Frisbie in 1991.

William A Mehrens & Irvin J Lehmann 1973, Measurement and Evaluation in Education and Psychology, Holt Rinehart and Winston Inc.

R L Thorndike 1971, Educational Measurement, American Council on Education.

Repeats the point about items at the end of a timed test being omitted by a lot of students leading to skewed statistics.

References to try to get

These last two above have probably both been superseded by:

R L Thorndike 2004, Measurement and Evaluation in Psychology and Education (Seventh edition), Prentice Hall.

This looks like it might be worth getting (previous edition cited by J J Barnard):

L Crocker & J Algina 2006, Introduction to Classical and Modern Test Theory, Wadsworth Pub Co.

Other points

Another book mentioned that sometimes you want to, for example, analyse test data by group (e.g. male/female) to look for possible discrimination.

There is the idea that you can look at the reliability of a test by randomly splitting the class in half, and comparing the statistics for the two halves.

What you really want to do is compare item scores to the property you are trying to measure in the test (student's mathematical ability), as opposed to their score on the test as a whole. However, you don't have any measure of the property you are really interested in - the overall test score is the best (only) estimate you have of that.

The age of the references I have read so far means that they cannot assume the processing power of modern computers. Therefore, the procedures they describe are unnecessarily simplified.

Difficulties

What about repeat attempts at a particular quiz by the same student. What does this do to the analysis?

What about adaptive mode?

Conclusions

It is probably sufficient for Moodle to offer teachers a basic form of item analysis. This will catch obviously defective assessment items.

We should probably not try to implement very sophisticated item analysis schemes. They are open to misapplication, which is more of a drawback that then extra power they provide when used correctly.

Documentation