An On-line, Interactive, Computer Adaptive Testing Tutorial, 11/98 (http://EdRes.org/scripts/cat)
Lawrence M. Rudner
Welcome to our online computer adaptive testing (CAT) tutorial. Here you will have the opportunity to learn the logic of CAT and see the calculations that go on behind the scenes. You can play with an actual CAT. We provide the items and the correct answers. You can try different scenarios and see what happens. You can pretend you are a high ability, average or low ability examinee. You can intentionally miss easy items. You can get items right that should be very hard for you.
This tutorial assumes some background in statistics or measurement. However, I hope even the novice will be able to follow along.
Once you are familiar with the software, try taking the test under different scenarios:
Are you ready to try an actual computer adaptive test? You will first be asked to pick a true score and the true score scale (z, SAT, or percentiles; you can change the scale at any time.) The CAT will then start with the first item. To help you see what is going on behind the scenes, graphs of Information Functions, the Item Response Function for the selected item, and standard error will be presented. If any of the graphs are not clear, push the Explain button and detailed, tailored information will appear. The information presented by the Explain button varies as the CAT progresses, so you may want to push the button several times. After you have responded to about 10 items, you may want to push the Done button. Your item history will be presented. If you respond to more than 5 items, summary graphs will also be presented. Have fun. If you have suggested activities or suggested improvements, please let me (Lawrence M. Rudner, LRudner@edres.org) know.
What and Why of Computer Adaptive Testing
When an examinee is administered a test via the computer, the computer can update the estimate of the examinee's ability after each item and then that ability estimate can be used in the selection of subsequent items. With the right item bank and a high examinee ability variance, CAT can be much more efficient than a traditional paper-and-pencil test.
Paper-and-pencil tests are typically "fixed-item" tests in which the examinees answer the same questions within a given test booklet. Since everyone takes every item, all examinees are administered a items that are either very easy or very difficult for them. These easy and hard items are like adding constants to someone's score. They provide relatively little information about the examinee's ability level. Consequently, large numbers of items and examinees are needed to obtain a modest degree of precision.
With computer adaptive tests, the examinee's ability level relative to a norm group can be iteratively estimated during the testing process and items can be selected based on the current ability estimate. Examinees can be given the items that maximize the information (within constraints) about their ability levels from the item responses. Thus, examinees will receive few items that are very easy or very hard for them. This tailored item selection can result in reduced standard errors and greater precision with only a handful of properly selected items.
This tutorial employs the widely accepted three parameter item response theory model first described by Birnbaum (1968). Under the 3 parameter IRT model, the probability of a correct response to a given item i is a function of an examinee's true ability and three item parameters
Each item i has a different set of these three parameters. These parameters are usually calculated based on prior administrations of the item.
The model states that probability of a correct response to item i for examinee j is a function of the three item parameters and examinee j's true ability.
This function is plotted below with a_{i} = 2.0, b_{i} = 0.0, c_{i} = .25 and varying from -3.0 to 3.0.
The horizontal axis is the ability scale, ranging from very low (-3.0) to very high (+3.0). When ability follows the normal curve, 68% of the examinees will have an ability between -1 and +1; 95% will be between -2.0 and +2.0. The vertical axis the is the probability of responding correctly to this item (defined by the three item parameters) given = .
The lower asympote is at c_{i}=.25. This is the probability of a correct response for examinees with very little ability (e.g. = -2.0 or -2.6). The curve has an upper asympote at 1.0; high ability examinees are very likely to respond correctly.
The b_{i} parameter defines the location of the curve's inflection point along the theta scale. Lower values of b_{i} will shift the curve to the left; higher to the right. The b_{i} does not effect the shape of the curve.
The a_{i} parameter defines the slope of the curve at its inflection point. The curve would be flatter with a lower value of a_{i}; steeper with a higher value. Note that when the curve is steep, there is a large difference between the probabilities of a correct response for a) examinees whose ability is slightly below (left) of the inflection point and b) examinees whose ability is slightly above the inflection point. Thus a_{i} denotes how well the item is able to discriminate between examinees of slightly different ability (within a narrow effective range).
One of the positive features about IRT is that distributional assumptions about ability do not need to be made and the scaling of ability is arbitrary. It is common, and convenient, to place ability scores on a scale with mean of zero and a standard deviation of one. When, this scaling is done, it is common to find ability scores, mainly, between -3.0 and +3.0, or in other words, regardless of the ability score distribution, it is common to find nearly all of the scores within three standard deviations of the mean ability score.
If you are new to Item Response Theory, I encourage you to play with varying the item parameter values for this function at http://edres.org/scripts/cat/genicc.htm.
Computer adaptive testing can begin when an item bank exists with IRT item statistics available on all items, when a procedure has been selected for obtaining ability estimates based upon candidate item performance, and when there is an algorithm chosen for sequencing the set of test items to be administered to candidates.
The CAT algorithm is usually an iterative process with the following steps
Several different methods can be used to compute the statistics needed in each of these three steps. Hambleton, Swaminathan, and Rogers (1991); Lord (1980); Wainer, Dorans, Flaughter, Green, Mislevey, Steinberg, Thissen (1990); and others have shown how this can be accomplished using Item Response Theory.
Treating item parameters as givens, the ability estimate is the value of theta that best fits the model. When the examinee is given a sufficient number of items, the initial estimate of ability should not have a major effect on the final estimate of ability. The tailoring process will quickly result in the administration of reasonably targeted items. The stopping criterion could be time, number of items administered, change in ability estimate, content coverage, a precision indicator such as the standard error, or a combination of factors.
Step 1 references selecting the "best" next item. Little information about an examinee's ability level is gained when the examinee responds to an item that is much too easy or much too hard. Rather one wants to administer an item whose difficulty is closely targeted to the examinee's ability. Furthermore, one wants to give an item that does a good job of discriminating between examinees whose ability levels are close to the target level.
Using item response theory, we can quantify the amount of information provided by an item at a given ability level. Under the maximum information approach to CAT, the approach used in this tutorial, the "best" next item is the one that provides the most information (In practice constraints are incorporated in the selection process.) With IRT, maximum information can be quantified as the standardized slope of P_{i}() at . In other words
where P_{i}() is the probability of a correct response to item i, P^{'}_{i}() is the first derivative of P_{i}(), and I_{i}() is the information function for item i.
Thus, for Step 1, I_{i}() for each item can be evaluated using the current ability estimate. While maximizing information is perhaps the best known approach to selecting items, Kingsbury and Zara (1989) describe several alternative item selection procedures.
In Step 3, a new ability estimate is computed. The approach used in this tutorial is a modification of the Newton-Raphson iterative method for solving equations, outlined in Lord (1980, p 181). The examination starts with an initial estimate of _{S}, computes the probability of a correct response for each item using _{S}, and then adjust the ability estimate to obtain improved agreement of the probabilities and the observed response vector. The process is repeated until the adjustment is extremely small. Thus:
where
The right hand side of the above equation is the adjustment. _{S+1} denotes the adjusted ability estimate. The denominator of the adjustment is the sum of the item information functions evaluated at _{S}. When _{S} is the maximum likelihood estimate of the examinee's ability, the sum of the item information functions is the test information function, I_{rr}.
The standard error associated with the ability estimate is calculated by first determining the amount of information the set of items administered to the candidate provides at the candidate's ability level--this is easily obtained by summing the values of the item information functions at the candidate's ability level to obtain the test information. Second, the test information is inserted in the formula below to obtain the standard error:
Thus, the standard error for individuals can be obtained as a by product of computing an estimate of an examinees ability.
In classical measurement, the standard error of measurement is a key concept and is used in describing the level of precision of true score estimates. With a test reliability of 0.90, the standard error of measurement for the test is about .33 of the standard deviation of examinee test scores. In item response theory-based measurement, and when ability scores are scaled to a mean of zero and a standard deviation of one (which is common), this level of reliability corresponds to a standard error of about .33 and test information of about 10. Thus, it is common in practice, to design CATs so that the standard errors are about .33 or smaller (or correspondingly, test information exceeds 10--recall that if test information is 10, the corresponding standard error is .33). [This paragraph was contributed by Ron Hambleton, University of Massachusetts].
Potential of computer adaptive tests
In general, computerized testing greatly increases the flexibility of test management. There potential is often described (e.g. Urry, 1977; Grist, Rudner, and Wise, 1989; Kreitzberg, Stocking, and Swanson, 1978; Olsen, Maynes, Slawson and Ho, 1989; Weiss and Kingsbury, 1984; Green, 1983). Some of the benefits are:
Limitations
Despite the above advantages, computer adaptive tests have numerous limitations, and they raise several technical and procedural issues:
Key Technical and Procedural Issues
There is a fair amount of guidance in the literature with regard to technical, procedural, and equity issues when using CAT with large scale or high stakes tests (Mills and Stocking, 1995; Stocking 1994; Green, Bock, Humphreys, and Reckase, 1984; FairTest, 1998). In this section, I outline the following issues:
Balancing content - Most CATs seek to quickly provide a single point estimate for an individual's ability. How can CAT, therefore, accommodate content balance?
The item selection process used in this tutorial depends solely on item information for choosing the next item to administer. While this procedure may be optimal for determining an individual's overall ability level, it doesn't assure content balance and doesn't guarantee that one could obtain subtest scores. Often one wants to balance the content of a test. The test specifications for a mathematics computation test, for example, may call for certain percentages of items to be drawn from addition, subtraction, multiplication and division.
If one is just interested in obtaining subtest scores, then each subtest can be treated as an independent measure and the items within the measure adaptively administered. When subtests are highly correlated, the subtest ability estimates can be used effectively in starting the adaptive process for subsequent subtests.
Kingsbury and Zara (1989, 1991) outline a constrained computer adaptive testing (C-CAT) that provides content balancing:
A major disadvantage to this approach is that the item groups must be mutually exclusive. When the number of item features of interest becomes large, the number of items per partition will become small. Further, it may not always be desirable to group items into mutually exclusive subsets.
Wainer and Kiely (1987) proposed the use of testlets as the basis for tailored branching. Items are grouped into small testlets developed following desired test specifications. The examinee responds to all the items within a testlet. The results on the testlet and previously administered testlets are then used to determine the next testlet. Wainer, Kaplan and Lewis (1992) have shown that when the size of the testlets are small, the gain of making the testlets themselves adaptive are modest.
Swanson and Stocking (1993) and Stocking and Swanson (1993) describe a weighted deviations model (WDM) which selects items using linear programming based on numerous simultaneous constraints involving statistical and content considerations. One constraint would be to maximize item information. Other constraints might be mathematical representations of the test specifications or a model to control for item overlap. The traditional linear programming model is not always applicable as some "constraints cannot be satisfied simultaneously with some other (often non-mutually exclusive) constraint. WDM resolves the problem by treating the constraints as desired properties and moving them to the objective function" (Stocking and Swanson, 1993, p280).
While WDM can often be solved using linear programming techniques, Swanson and Stocking (1993) provide a heuristic for solving the WDM:
The preferred choice would depend on the number and nature of the desired constraints
Administering items belong to sets - Can CAT accommodate items belong to sets?
In the typical reading assessment, the examinee reads a passage and responds to several questions concerning that passage. The stimulus material is presented just once. One would not want a computer adaptive test which treats each item independently. An examinee might be presented with the same stimulus multiple times, and have to work though a passage just to answer one question at a time.
Each passage could be treated like a testlet as described above. An alternative approach, described by Mills and Stocking (1996) would be to present the most targeted items within a passage. Depending on the test specifications, an examinee might receive 3 of the 10 questions associated with a given passage.
Examinee Considerations - What are some examinee issues with regard to CAT?
Wise (1997) raised several issues from the examinee's perspective, including item review and equity. He noted that research consistently reports that examinees want a chance to review their answers. He also noted that when examinees change their answers, they are more likely to legitimately improve their scores. Most CATs cannot accommodate an option for examinees to review their answers (Wainer's (1987) testlet approach is a notable exception). If review and answer changing were possible, a clever examinee could intentionally miss initial questions. The CAT program would then assume low ability and select a series of easy questions. The clever examinee would then go back and change the answers, getting them all right. The result could be a high percentage of correct answers which would result in an artifically high estimate of the examinee's ability.
While poor and minority children have less access to computers, Wise noted that the research on equity and CAT is mixed. He noted that there are racial and ethnic differences on the use of and desired amount of testing time. Yet some research has found that Blacks fare better on computer tests than on conventional tests. Wise concluded that, since the research is inconclusive, the issue should be investigated with regard to each test being developed.
Item exposure - How can CAT be modified to insure that certain items are not over-used.
Without constraints, the item selection process will select the statistically best items. This will result in some items being more likely to be presented than others, especially in the beginning of the adaptive process. Yet one would be interested in assuring that some items are not over-used. Overriding the item selection process to limit exposure will better assure the availability of item level information and enhance test security. However, it also degrades the quality of the adaptive test. Thus, a longer test would be needed.
One approach to control exposure is to randomly select the item to be administered from the a small group of best fitting items. For example, McBride and Martin (1983) suggest randomly selecting the first item from the five best fitting items, the second item from the four best fitting items, the third from a group of three, and the forth from a group of two. The fifth and subsequent items would be selected optimally. After the initial items, the examinees would be sufficiently differentiated and would optimally receive different items. Kingsbury and Zara (1989, p 369) report adding an option to Zara's CAT software to randomly select from the two to ten best items.
Sympson and Hetter (1985) developed an approach which controls item exposure using a probability model. The approach seeks to assure that the probability the item is administered, P(A) is less than some value r - the expected, but not observed, maximum rate of item usage. If P(S) denotes is the probability an item is selected as optimal, and P(A|S) denotes the probability the item is administered given that it was selected as optimal, then P(A)=P(A|S)*P(S). The values for P(A|S), the exposure control parameters for each item, can be determined though simulation studies.
Item pool characteristics - Can any test be used for CAT?
Lord (1980, p 152) pointed out that an item provides the most information at b_{i}_{} and that the most information that can be provided is
Maximum information is a function of both the a_{i} and c_{i} parameters. An item whose a_{i}_{}=1.00 will be 4 times more effective than an item whose a_{i}=.25. An item whose c_{i}=0.00 (i.e. a free response item) will be 1.6 times as effective and an item with c_{i}_{}= .25.
Thus, the ideal item pool for a computer adaptive test would be one with a large number of highly discriminating items at each ability level. The information functions for these items would appear as a series of peaked distributions across all levels of theta.
The item pool used in this tutorial is not ideal for computer adaptive testing. There are large numbers of low discriminating items and most item difficulties are between -1.0 and 1.0.
Another way to look at an item bank, is to look at the sum of the item information functions. This Test Information Function shows the maximum amount of information the item bank can provide at each level of by either traditional or CAT administration. The test information function for the item pool used in this tutorial is shown in the next figure:
The information that can be provided by this item bank peaks at = 1.4. The item bank is strongest when 0 < < 2. Again, these curves define upper bounds. In practice, the amount of information will be lower at all levels of because an examinee only takes a sample of the items in the item bank. In terms of CAT, fewer items will need to be administered to examinees in the 0 < < 2 range in order to achieve a given level of precision.
Item pool size - How big does an item pool need to be?
The size of the item pool needed depends on the intended purpose and characteristics of the tests being constructed. Weiss (1985) points out that satisfactory implementations of CAT have been obtained with an item pool of 100 high quality, well distributed items. He also notes that properly constructed item pools with 150-200 items are preferred. If one is going to incorporate a realistic set of constraints (e.g. random selection from among the most informative items to minimize item exposure; selection from within subskills to provide content balance) or administer a very high stakes examination, then a much larger item bank would be needed.
Shifting parameter estimates - Can one expect the item response theory item parameters to be stable under a computer adapted item administration?
Numerous studies using live examinees have documented the equivalence of paper-and-pencil and computer adaptive tests by demonstrating equal ability estimates, equal variances, and high correlations (see Bergstrom, 1992 for a synthesis of 20 such studies). This equivalence implies that the underlying assumptions for CAT were met and that CAT was robust.
Two key assumptions for IRT-based computer adaptive testing are unidimensional item pools and fixed item parameters. The dimensionality of the item pool should not be a major concern since it is routinely investigated as part of quality test development. Of concern, however, is whether the IRT parameters change due to mode of administration or change over time. When examining person-fit by test booklet, Rudner, Bracey and Skaggs (1996) noted that the fit for calculator items was much worse than the fit for the same items when administered without a calculator. Thus, there is a very real possibility that the item parameters under CAT may not be the same as under paper-and-pencil administration. Also of concern is the possibility that the IRT parameters will shift due to changes in curriculum or population characteritsics. The issue of shifting parameters, however, could easily be addressed by recalculating IRT parameters after a CAT administration and comparing values.
Stopping rules - How does one determine when to stop administering items? What are the implications of different stopping rules?
One of the theoretical advantages of computer adaptive tests is that testing can be continued until a satisfactory level of precision is achieved. Thus, if the item pool is weak at some section of the ability continuum, additional items can be administered to reduce the standard error for the individual. Stocking (1987), however, showed that such variable-length testing can result in biased estimates of ability, especially if the test is short. Further, the nuances of a precision based (and hence a variable test length) stopping rule would be hard to explain to a lay audience.
Baker, F. (1985) The basics of item response theory. Portsmouth, NH: Heinemann Educational Books (out of print).
Bergstrom, B. (1992) Ability measure equivalents of computer adaptive and pencil and paper tests: A research synthesis. Paper presented at the annual meeting of the American Educational Research Association, San Francisco.
Birnbaum, A. "Some latent trait models and their use in infering an examinee's ability. Part 5 in F.M. Lord and M.R. Novick (1986) Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley (out of print).
Cleary, T.A. and R. Linn (1969). Error of measurement and the power of a statistical test. British Journal of Mathematics and Statistical Psychology, 22, 49-55.
Eignor, D., Stocking, M, Way, W., & Steffen, M. (1993). Case studies in computer adaptive test design through simulation. Princeton, NJ: Educational Testing Service Research Report RR-93-66.
Fairtest (1998). Computerized Testing: More Questions Than Answers. FairTest Fact Sheet. http://www.fairtest.org/facts/computer.htm
Green, B. F., Bock, R.D., Humphreys, L., Linn, R.L., Reckase, M.D. (1984). Technical Guidelines for Assessing Computerized Adaptive Tests, Journal of Educational Measurement, 21, 4, pp. 347- 360.
Hambleton, R.K. & R.W. Jones (Fall, 1993). An NCME Instructional Module on Comparison of Classical Test Theory and Item Response Theory and Their Applications to Test Development. Educational Measurement: Issues and Practice, 12(3), 38-47.
Hambleton, R.K., H. Swaminathan, & H.J. Rogers, Fundamentals of Item Response Theory. Newbury Park CA: Sage.
Kingsbury, G., Zara, A. (1989). Procedures for Selecting Items for Computerized Adaptive Tests. Applied Measurement in Education, 2(4), 359-75.
Kingsbury, G., Zara, A. (1991). A Comparison of Procedures for Content-Sensitive Item Selection in Computerized Adaptive Tests. Applied Measurement in Education, 4 (3) 241-61
Kreitzberg, C., Stocking, M.L. & Swanson, L. (1978). Computerized Adaptive Testing: Principles and Directions, Computers and Education. 2, 4, pp. 319-329.
Lord, F.M. and M. Novick (1968) Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley (out of print).
Lord, F.M. (1980). Application of Item Response Theory to Practical Testing Problems. Hillsdale, NJ: Lawrence Erlbaum Associates.
Mills, C. & Stocking, M. (1996) Practical issues in Large Scale Computerized Adaptive Testing. Applied Measurement in Education, 9(4), 287-304.
Rist, S., Rudner, L. & Wise, L. (1989). Computer Adaptive Tests. ERIC Digest Series.
Rudner, L. (1990). Computer Testing: Research Needs Based on Practice. Educational Measurement: Issues and Practice, 2, 19-21.
Rudner, L., Bracey, G. & Skaggs, G.. (1996). Use of person fit statistics in one high quality large scale assessment. Applied Measurement in Education, January.
Stocking, M.L. & Swanson, L. (1993). A Method for Severely Constrained Item Selection in Adaptive Testing. Applied Psychological Measurement, 17(3), 277-292.
Swanson, L. & Stocking, M.L. (1993). A Model and Heuristic for Solving Very Large Item Selection Problems. Applied Psychological Measurement, 17(2)2, 151-166.
Sympson, J.B. & Hetter, R.D. (1985) Controlling item exposure rates in computerized adaptive testing. Proceedings of the 27^{th} annual meeting of the Military Testing Association. San Diego, CA: Navy Personnel Research and Development Center.
van der Linden & amp; R.K. Hambleton (Editors) (1997). Handbook of Modern Item Response Theory. London: Springer Verlag
Wainer, H.. On Item Response Theory and Computerized Adaptive Tests: The Coming Technological Revolution in Testing, Journal of College Admissions. 1983, 28, 4, pp. 9-16.
Wainer, H. (1993) Some practical considerations when converting a linearly administered test to an adaptive format. Educational Measurement: Issues and Practice, 12,1,15-21.
Wainer, H & Kiely, G. (1987). Item clusters and computerized adaptive testing: the case for testlets. Journal of Educational Measurement, 24, 189-205.
Wainer, H., Dorans, N., Flaughter, R. Green, B., Mislevy, R., Steinberg, L. & Thissen, D. (1990) Computerized adaptive testing: A primer. Hillsdale, NJ: Lawrence Erlbaum Associates.
Wainer, J., Kaplan, B., & Lewis, C. (1990) A comparison of the performance of simulated hierarchical and linear testlets. Journal of Educational Measurement, 27, 1-14., and
Weiss, D.J. (1985). Adaptive Testing by Computer, Journal of Consulting and Clinical Psychology. 1985, 53, 6, pp. 774-789.
Wise, S. (1997). Examinee issues in CAT. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago.
Special thanks go to Ronald K. Hambleton, University of Massachusetts; Dennis Roberts, Pennsylvania State University; and Pamela R. Getson, National Institutes for Health, for their helpful comments on earlier versions of this program. My appreciation also goes to Kristen Starret for redrawing the graphics that accompany each item and Scott Hertzberg for proofreading the text.
This tutorial was developed using Active Server Pages, the scripting language of Microsoft's Windows NT Internet Information Server.