Measurement Decision Theory
Developed by Wald (1947), first applied to measurement by Cronbach and Gleser (1957), and now widely used in engineering, agriculture, and computing, decision theory provides a simple model for the analysis of categorical data. It is most applicable in measurement when the goal is to classify examinees into one of two categories, e.g. pass/fail or master/non-master.
From pilot testing, one estimates
After the test is administered, one can compute (based on the examinee's responses and the pilot data):
Classical measurement theory and item response theory are concerned primarily with rank ordering examinees across an ability continuum. Those models are concerned, for example, with differentiating examinees at the 90th and 92nd percentiles. But one is often interested in classifying examinees into one of a finite number of discrete categories, such as pass/fail or proficient/basic/below-basic. This is a simpler outcome and a simpler measurement model should suffice. Measurement Decision Theory is one such simpler tool.
Measurement decision theory requires only one key assumption - that the items are independent. Thus, the tested domain does not need to be unidimensional, examinee ability does not need to be normally distributed, and one doesn’t need to be concerned with the fit of the data to a theoretical model as in item response theory (IRT) or in most latent class models. The model is attractive as the routing mechanism for intelligent tutoring systems, for end-of-unit examinations, for adaptive testing, and as a means of quickly obtaining the classification proportions on other examinations. Very few pilot test examinees are needed and, with very few items, classification accuracy can exceed that of item response theory. Given these attractive features, it is surprising that the model has not attracted wider attention within the measurement community.
Isolated elements of decision theory have appeared sporadically in the measurement literature. Key articles in the mastery testing literature of the 1970s employed decision theory (Hambleton and Novick, 1973; Huynh, 1976; van der Linden and Mellenbergh, 1977) and should be re-examined in light of today’s measurement problems. Lewis and Sheehan (1990) and others used decision theory to adaptively select items. Kingsbury and Weiss (1983), Reckase (1983), and Spray and Reckase (1996) have used decision theory to determine when to stop testing. Most of the research to date has applied decision theory to testlets or test batteries or as a supplement to item response theory and specific latent class models. Notable articles by Macready and Dayton (1992), Vos (1997), and Welch and Frick (1993) illustrate the less prevalent item-level application of decision theory examined in this tutorial.
Overview and notation
The objective is to form a best guess as to the mastery state (classification) of an individual examinee based on the examinee’s item responses, a priori item information, and a priori population classification proportions. Thus, the model has four components: 1) possible mastery states for an examinee, 2) calibrated items, 3) an individual’s response pattern, and 4) decisions that may be formed about the examinee.
There are K possible mastery states, that take on values mk. In the case of pass/fail testing, there are two possible states and K=2. One usually knows, a priori, the approximate proportions for the population of all examinees in each mastery state.
The second component is a set of items for which the probability of each possible observation, usually right or wrong, given each mastery state is also known a priori,
The responses to a set of N items form the third component. Each item is considered to be a discrete random variable stochastically related to the mastery states and realized by observed values zN,. Each examinee has a response vector, z, composed of z1, z2, ... zN. Only dichotomously scored items are considered in this article.
The last component is the decision space. One can form any number of D decisions based on the data. Typically, one wants to guess the mastery state and there will be D=K decisions. With adaptive or sequential testing, a decision will be to continue testing will be added and thus there will be D=K+1 decisions. Each decision will be denoted dk.
Testing starts with the proportion of examinees in the population that are in each of the K categories and the proportion of examinees with each category that respond correctly. The population proportions can be determined a variety of ways including from prior testing, transformations of existing scores, existing classifications, and judgement. In the absence of information equal priors can be assumed. The proportions that respond correctly can be derived from a small pilot test involving examinees that have already been classified or transformations of existing data. Once these sets of priors are available, the items are administered, responses (z1, z2, ... zN) observed, and then a classification decision, dk, is made based on the responses to those items.
Proportions from the pilot test are treated as probabilities and the following notation is used:
An estimate of an examinee’s mastery state is formed using the priors and observations. By Bayes Theorem,
The posterior probability P(mk|z) that the examinee is of mastery state mk given his response vector is equal to the product of a normalizing constant (c), the probability of the response vector given mk, and the prior classification probability. For each examinee, there are K probabilities, one for each mastery state. The normalizing constant in (1),
assures that the sum of the posterior probabilities equals 1.0.
Assuming local independence,
That is, the probability of the response vector is equal to the product of the conditional probabilities of the item responses. In this tutorial, each response is either right (1) or wrong (0) and P(z1=0|mk) = 1- P(z1=1|mk).
Three key concepts from decision theory are discussed next:
The model is illustrated here with an examination of two possible mastery states m1 and m2 and two possible decisions d1 and d2 which are the correct decisions for m1 and m2, respectively. The examples use a three item test with the item statistics shown in Table 1. Further, also based on pilot test data, the prior classification probabilities are P(m1)=0.2 and P(m2)=1-P(m1) = 0.8.In the example, the examinee’s response vector is [1,1,0].
The task is to make a best guess as to an examinee’s classification (master, non-master) based on the data in Table 1 and the examinee’s response vector. From (2), the probabilities of the vector z= [1,1,0] if the examinee is a master is .6*.8*.4 = .19, and .09 if he is a non-master. That is, P(z|m1)=.19 and P(z|m2)=.09. Normalized, P(z|m1)=.68 and P(z|m2)=.32.
A sufficient statistic for decision making is the likelihood ratio
which for the example is L(z)= .09/.19 = .47. This is a sufficient statistic because all decision rules can be viewed as a test comparing L(z) against a criterion value 8.
The value of 8 reflects the selected approaches and judgements concerning the relative importance of different types of classification error.
Maximum-likelihood decision criterion
Minimum probability of error decision criterion
Maximum a posteriori (MAP) decision criterion
Bayes Risk Criterion
Rather than make a classification decision for an individual after administering a fixed number of items, it is possible to sequentially select items to maximize information, update the estimated mastery state classification probabilities and then evaluate whether there is enough information to terminate testing. In measurement this is frequently called adaptive or tailored testing. In statistics, this is called sequential testing.
At each step, the posterior classification probabilities p(mk|z) are treated as updated prior probabilities p(mk) and used to help identify the next item to be administered. To illustrate decision theory sequential testing, again consider the situation for which there are two possible mastery states m1 and m2 and use the item statistics in Table 1. Assume the examinee responded correctly to the first item and the task is to select which of the two remaining items to administer next.
After responding correctly to the first item, the current updated probability of being a master is .6*.2/(.6*.2+.3*.8) = .33 and the probability of being a non-master is .66 from formula (1).
The current probability of responding correctly is
Applying (5), the current probability of correctly responding to item 2 is P(z2=1)=.8*.33+ .6*.66 = .66 and, for item 3, P(z3=1)=.53. The following are some approaches to identify which of these two items to administer next.
Minimum expected cost
This article has discussed procedures for making a classification decision and procedures for selecting the next items to be administered sequentially. This section presents procedures for deciding when one has enough information to hazard a classification guess. One could make this determination after each response.
Perhaps the simplest rule is the Neyman-Pearson decision criteria - continue testing until the probability of a false negative, P(d2|m1), is less than a preselected value ". Suppose "= .05 was selected. After the first item, the probability of being a non-master is P(m1|z) = .66. If the examinee is declared a non-master, then the current probability of this being a false negative is (1-.33). Because this is more than ", the decision is to continue testing.
A variant of Neyman-Pearson is the fixed error rate criterion - establish two thresholds, "1 and "2, and continue testing until P(d2|m1) < "1 and P(d1|m2) < "2. Another variant is the cost threshold criteria. Under that approach, costs are assigned to each correct and incorrect decision and to the decision to take another observation. Testing continues until the cost threshold is reached. A variant on that approach is to change the cost structure as the number of administered items increases.
Wald’s (1947) sequential probability ratio test (SPRT, pronounced spurt) is clearly the most well-known sequential decision rule. SPRT for K multiple categories can be summarized as
where the P(mj)’s are the normalized posterior probabilities, " is the acceptable error rate, and $ is the desired power. If the condition is not meet for any category k, then testing continues. In the measurement field, there is a sizeable and impressive body of literature illustrating that SPRT is very effective as a termination rule for IRT based computer adaptive tests (c.f. Reckase, 1983; Spray and Reckase, 1994, 1996; Lewis and Sheehan, 1990; Sheehan and Lewis, 1992).
In their introduction, Cronbach and Gleser (1957) argue that the ultimate purpose for testing is to arrive at qualitative classification decisions. Today’s decisions are often binary, e.g. whether to hire someone, whether a person has mastered a particular set of skills, whether to promote an individual. Multi-state conditions are common in state assessments, e.g. the percent of students that perform at the basic, proficient or advanced level. The simple measurement model presented in this article is applicable to these and other situations where one is interested in categorical information.
The model has a very simple framework - one starts with the conditional probabilities of examinees in each mastery state responding correctly to each item. One can obtain these probabilities from a very small pilot sample. This research demonstrated that a minimum cell size of one examinee per item is a reasonable calibration sample size. The accuracies of tests calibrated with such a small sample size are extremely close to the accuracies of tests calibrated with hundreds of examinees per cell.
An individual’s response patterns is evaluated against these conditional probabilities. One computes the probabilities of the response vector given each mastery level. Using Bayes’ theorem, the conditional probabilities can be converted to an a posteriori probabilities representing the likelihood of each mastery state. Alternative decision rules were presented.
This article examined two ways to adaptively, or sequentially, administer items using the model. The traditional decision theory sequential testing approach, minimum cost, and a new approach, information gain, which is based on entropy and comes from information theory.
Research has showed that very few pilot test examinees are needed to calibrate the system (Rudner, in press). One or two examinees per cell per item result in a test that is as accurate as one calibrated with hundreds of pilot test examinees per cell. The results were consistent across item pools and test lengths. The essential data from the pilot is the proportions of examinees within each mastery state that respond correctly. One does not truly need a priori probabilities of a randomly chosen examinee being in each mastery state. Uniform priors can be expected to increase the number of needed items and not seriously affect accuracy given properly chosen stopping rules.
This is clearly a simple yet powerful and widely applicable model. The advantages of this model are many --the model
It is the author’s hope that this research will capture the imagination of the research and applied measurement communities. The author can envision wider use of the model as the routing mechanism for intelligent tutoring systems. Items could be piloted with a few number of examinees to vastly improve end-of-unit examinations. Certification examinations could be created for specialized occupations with a limited number of practitioners available for item calibration. Short tests could be prepared for teachers to help make tentative placement and advancement decisions. A small collection of items from a one test, say state-NAEP, could be embedded in another test, say a state assessment, to yield meaningful cross-regional information.
The research questions are numerous. How can the model be extended to multiple rather than dichotomous item response categories? How can bias be detected? How effective are alternative adaptive testing and sequential decision rules? Can the model be effectively extended to 30 or more categories and provide a rank ordering of examinees? How can we make good use of the fact that the data is ordinal? How can the concept of entropy be employed in the examination of tests? Are there new item analysis procedures that can improve measurement decision theory tests? How can the model be best applied to criterion referenced tests assessing multiple skills, each with a few number of items? Why are minimum cost and information gain so similar? How can different cost structures be effectively employed? How can items from one test be used in another? How does one equate such tests? The author is currently investigating the applicability of the model to computer scoring of essays. In that research, essay features from a large pilot are treated as items and holistic scores as the mastery states.
This tutorial was developed with funds from the National Library of Education, U.S. Department of Education, award xxx and from the National Institute for Student Achievement, Curriculum and Assessment, U.S. Department of Education, grant award R305T010130. The views and opinions expressed in this article are those of the author and do not necessarily reflect those of the funding agency.