Text

Discriminatory Effectiveness of Validated,Modified & New Items in Examinee Test Performance
	ML20217H270
Person / Time
Issue date:	04/30/1997
From:	Usova G; NRC
To:	;
References
	NUDOCS 9804290336
	Download: ML20217H270 (13)
	v • d • e

1

(

April 1997 i

misVisVerepriatiereft]offalpaporlisteseedTierT~p^**

ablissettes is!alja= =al b edaseiebenges?ma */

thistesafttAsdame the e.elava&Aableiwibthe?sene U%

.ee.1 itme.t:tmeta..? o'r:. Ce r befewe' hatht~

m etwo;s The Discriminatory Effectiveness of Validated, Modified, and New Items in Examinee Test Performance Gecrge M. Usova, Ph.D.

Training and Assessment Specialist Nuclear Regulatory Commission Washington, D.C. 20555 Introduction Data collected and analyzed by the Operator Licensing Branch at the U.S. Nuclear Regulatory Commission (NRC) has shed additional light on the discriminatory effectiveness of validated, modified, and new test items in examinee test performance. Examination data gathered over a period of five years shows remarkable and consistent differences in test item performance among validated (bank items), modified items, and new items used on the NRC Generic Fundamentals Examination (GFE) . Since there is a paucity of empirical research in this area, these recent and substantive findings may be valuable to organizations and agencies who use test item banks in developing examinations and serve as useful information in setting examination development policy.

Although the NRC had not originally set out to measure the effectiveness of item discrimination among three categories of test items, the data on examination results, as collected and presented over time, began to yield a clear and distinctive pattern on how each category of item functioned. This serendipitous finding served as the basis to continue data collection and analysis. For five continuous years and over 28 separate examinations, the pattern of differences in item discrimination among item categories has remained consistently discussionas distinctive willfollow.

that be shown and discussed in the Tables and nummination Description Since 1989, the Operator Licensing Branch of the Office of Nuclear Reactor Regulation at the NRC has administered the Generic Fundamental Examination (GFE). This examination is administered twice yearly to candidates seeking a Reactor Operator (RO) or Senior Reactor Operator (SRO) license in nuclear power plants nationwide. In 1991, the NRC initiated a successful approach to GFE examination development through the combined use of validated, modified, and new items to yield a discriminating, content-valid examination.

9804290336 970430 1 !i PDR MISC ;

9804290336 PDR m

, ,, J i NN

(

/4 i 5 C- ~

The GFE are two separately administered 100 item examinations specific to Boiling Water Reactors (BWR) ar.d Pressurized Water Reactors (PWR). The license examinations measure candidate knowledge in three areas: (1) reactor theory (2) plant components, and (3) thermodynamics. This examination must be passed with a minimum score of 80 percent before candidates are eligible to take the plant-specific written examination and operating test at their facilities.

The GFE examination measures fundamental knowledge applicable to all reactor operators and senior reactor operators; for example, test items include that measure questions candidate knowledge of plant components on valves, sensors and detectors, controllers and positioners, pumps, motors, generators and so forth while reactor theory reactivity questions include questions on topics covering coefficients, poisons, and others. control rods, neutrons, fission product In essence, the GFE assesses candidate basic knowledge for understanding and plant behavior, and in this regard, the elements of nuclear power represents the underpinnings for later control room related problem-solving, trouble-shooting, and decision-making.

The GFE multiple choice item format has proven successful for objective scoring and standardizing a nationally-administered exam of this type.

The multiple choice format has also proven to be effective for testing higher cognitive thought processes.

Carefully, well-designed multiple choice test items require candidates to think through responses and alternatives where they $

must weigh and consider the conditions posed in the stem of the question and further discriminate and eliminate among plausible distractors before choosing the correct answer. The mental l processes involved in arriving at the correct answer often ]

challenge candidates problem solving to analyze and synthesize information in a context.

test item taken from a past PWR exam:An example follows of a reactor theory Following a reactor trip, the power decrease rate initially stabilizes at negative one-third decade per minute when: \

A.

B. decay gamma heating starts adding negative reactivity.

away. the long-lived delayed neutron precursors have decayed

\

C. l away. the short-lived delayed neutron precu~ sors have decayed D.

theflux neutron installed becomes neutron source contribution to the total significant.

Answer: C.

I l

2

l Exam Development l

Each examination is designed in accordance with approved NRC I sample plan and test specification requirements. The sample plan assures that the examination represents and includes a balance of ;

i test items that cover a broad spectrum of knowledge and abilities required of the operator to perform the job. This job-based testing approach assures that the examination is relevant and valid to actual operator tasks, allowing confidence that operators possess fundamental knowledge related to their job.

Specifically, in order for the GFE to meet sample plan requirements, each BWR and PWR exam must consist of 44 i " component" test items, 28 " reactor theory" test items, and 28

" thermodynamics" test items; each of these three broad categories are further broken down into finer sub-areas for test item development.

Each examination undergoes a rigorous process of test item development, reviews, pilot testing, and pre-validation phases where items are clarified and improved. Finally, before the examination is administered, each item undergoes an NRC/ subject matter expert (SME) technical and psychometric integrity review.

The entirety of the review processes help to refine and improve the quality of each test item as a valid, reliable, and fair measurement of job-related knowledge. After the examination is administered, the utilities involved have the opportunity to comment upon any test item during a post-examination comment period.

Test Bank Use in Item Development The NRC has developed a successful approach to examination development, through the combined use of validated, modified and new items, to yield a discriminating, content-valid examination.

Since 1992, the NRC has adopted a 50-40-10 distribution of bank items-modified-new items, respectively. This approach was adopted to maintain sufficient discrimination to the examination.

In essence, one-half of the test items appearing on any particular exam are drawn from a test item bank from previously validated test items; to date there are 2291 items in the validated bank (1155 BWR and 1136 PWR) . Those validated test bank items are entered into the respective BWR and PWR Generic Fundamentals Test Item Catalogs maintained by the Institute for Nuclear Power Operations (INPO) who disseminate the Catalogs.to their member utilities. The remaining one-half of the examination consists of 40 modified items and 10 new items. A modified item is drawn from a previously administered item in the bank and is defined as one that has one or more conditions l 3

i l

changed in its stem and one which has one or more distractor changes. As more and more items are developed that assess l

l similar knowledge and abilities, there is increasing similarity--

rather than differences--to previously seen items; the similarity effect is overcome through professional diligence and attention to detail that sufficiently alters item stems and distractor (s) to evoke and engage slightly different applications of a higher cognitive mental orocess and one that fosters item discrimination.

Newly developed items, i.e., not seen before in a catalog or appeared on any previous exam, are also developed to ensure that the examination discriminates sufficiently to identify those ,

candidates who are either unprepared or dnknowledgeable in generic fundamentals. i The discriminatory intent of new items is similar to that of the modified items. The difference between a modified item and a new item is largely one of degree. The modified item is different having not been seen before but bearing a resemblance to a previously validated item whereas the new item has no such direct resemblance to a validated item. In short, each examination administered has a mix of discriminatory items to support the examination's purpose and integrity.

Development of New and Modified Items New and modified items are developed and assessed through using Bloom's Taxonomy as a reference benchmark; the NRC uses this taxonomy to classify the levels of knowledge of test items.

Bloom's Taxonomy is a classification scheme that permits the classification cf items by the level (depth) of mental thought and performance required to answer the test question. The three levels, as modified, in ascending order are as follows:

Level 1. Fundamental knowledge a.k.a. simple memory Level 2. Comprehension Level 3. Analysis / Synthesis / Application According defined:

to the taxonomy, the three levels of knowledge can be Fundamental Knowledge testing is defined as a simple mental process that tests the recall or recognition of information bits with concrete referents; examples include knowledge of terminology, definitions, or specific facts.

t Comprehension testing involves the mental process of l understanding the material through relating it to its own parts or to some other material; examples include rephrasing information in different words, describing or recognizing relationships, showing slutilarities and dif ferences among parts or wholes, recognizing how systems interact, including .

consequences or implications.

l 4

l

Analysis, synthesis, and application testing is a more active and product-oriented testing which involves the

! multi-part mental process of assembling, sorting, or integrating the parts (infbrmation bits and their relationships) so that the whole, and the sum of its parts can be used to: predict an event or outcome, solve a problem, or create something new, i.e., mentally using the knowledge and its meaning to problem-solve or create.

In the hierarchy, from its fundamental base, each level of 1 knowledge builds upon the lower level and cumulatively embraces that level, rising to the third and most comprehensive level-- l Analysis / Synthesis / Application. It is at Level 3, the mcst valid l and efficient level for testing, that subsumes and implicitly tests at the lower levels of knowledge. In other words, testing at the application or "use" level of knowledge, defacto, also ;

tests the comprehension and fundamental knowledge levels upon which Level 3 was built, consequently, it is more efficient to test at the higher levels than at the lower levels.

Results Data gathered since 1988, the inception of the GFE test program, includes a total of 43 examinations that has tested 4002 candidates (See Table 1). The difference in the uneven numbers between BWR and PWR occurs because the BWR exam, only, was used as a pilot examination at the inception of the program.

TABLE 1 CUMULATIVE JATA A total of 4,002 individuals have taken the GFEs since the inception of the program in September 1988. The table below

' provides cumulative data for a number of categories. The data is further grouped into BWR, PWR, and total classifications.

FACILITY NO. OF NO. OF NO. CF MEAN TYPE ITEM EXAMS EXAMINEES FAILURES SCORE (%) BANK SIZE i BWR 22 1456 98 88.12 1155 PWR 21 2546 110 89.96 1136 TOTAL 43 4002 208 89.29 2291 5

From October 1991 through October 1996 (the dates upon which data on item performance was collected and analyzed) there were a total of2).28 examinations Table Item bank size that is includedas reported 2064 a

candidates total number (Refer of 2291 to in both Tables. It should be noted that the BWR and PWR banks increase by approximately 50 items per administration which encompass the modified and new items that are entered into the validated paol of items after each exam is administered. Items that are determined to be psychometrically or technically flawed in the post examination analysis are not entered and/or are deleted from the bank.

i TABLE 2 CUMULATIVE DATA A total of 2,064 individuals have taken the GFES since October 1991. The table below provides cumulative data for a number of categories. The data is further grouped into BWR, PWR, and total classifications.

FACILITY NO. OF NO. OF NO. OF MEAN TYPE ITEM EXAMS EXAMINEES FAILURES SCORE (%) BANK SIZE BWR 14 728 43 89.72 1155 PWR 14 1336 48 90.81 1136 TOTAL 28 2064 91 90.42 2291 l As was stated earlier, the data and analysis show how items performed in three specific categorical areas:

1. validated bank items

2. modified bank items and

3. new items.

Tables 3 and 4 show the discriminatory effectiveness in item subscores among these three categories of items. Tables 3 and 4, respectively, show separate BWR and PWR vendor results. It is clear to observe a distinct pattern of mean scores performance for each item type and by vendor type. The overall mean scores for both examinations fall within a relatively stable range of 87-92 and can be viewed as a reference benchmark for evaluating the other item sub-categories.

6

l1 i{ill lJ 551 4 9

r- 89 93 48 6

8

_ __ S p _

_ M A SE

~ _ MT

~ _ TE I

_ ST D

}ii '7 _

EI E T DT S 94 2.78 9067 55 _

L E AM

_ L TDE t

c 8~9 5 _

AAL T I O ' _

RD A I

E LI VW V ANE

[j E _ ___ OVUN 48781 -

9I- 8356

_ _ c oc n~'8 u 9 87

_ . _l] _

~ - - -

J~ _ _ _ .-

$uo

!l 94i1 3. j -

b5987 E554 e

_ . $a<

F~ _

_._ _ _ FE

~ - .

~ -

L

'389.[87 __ G _. .

V 9 638 1

$Uo t

c898 .

O~

3541.1 tj .

_ _ .__ }

R W

- $a<

9 B

- 7359 J

n8986 u~ ~

R _. __

~

O - _.

_ 3 uO ii 4f l4;j; F _

3 3 3 4945 9 3 6.d S _

E ,

m t

3a4 E e -

L F >898[B .- _

_ R _

B A

T

.q1 .

_ . _ __ O C

12 9.1- 929i 3]}'4. ~5 _.

S Eae E t

O c 89B E _ .

~ T A

21 884 f -

[ G _

D A

9 ..6 1 7 1. [_6 _ ?uo n89 9334 8 8 99.578 t

R _ _

u c E J _ ~

09_8_,7QV g_8a2 O_9 _

A 8ea 6_4 2 41 3 1

28 2 9- 1 7. 6 7 9 9 b 9 9 8 r- 9 9 2 4 9 e p 8 F A v

1[

1 9- 5. i 6 5 5 6 9 8 8 2 3 9- 0 gsO t

8 9 8 48 5 c c 9 987 t

O O \

= aaa EI E,l M T M T T E I T E I gag ST D S T D E E I

E I E

T D T S T D T S L E AM L E A M ~

LT DE L T D E A I L T A A L T I 9g Rd A I RD A I 5 '

El l V W 5 L V W I 8 '

V A N E 7 A NE OVUN VUN W:aOg-

\

p 583 9

3 6 ~

r

- 9 0 8 8 S p 8 9 8 7 M

A SE MT TE I ST D EI E 4 1 7 9 7 TDT 9- 1 4 7 5 L

L TE DA t

c 9 9 8 8 AALI O RDA l E l V V AN

'4 1 5 3 4 9- 2 9 1

] OVU -

c : 0 0

n 9 9 8 - - - _

u ~ ku0 J

488 9- 85

'6 5 h_ '

- b 89 3 5 8 7

_ ka<

e F -

E 3 4 9 6 3 F ku0 9- 9 3 _

9 6 G t

c 8 9 8 6 .

O L _ R -

ka< .

j [_ W

]

33845 ij s_

9- -

1 49 _

_ P 1

n 998 8

u _

J_

R ~

O _

~ kuO _

F -

4 327 9 1 42 6 9 1

_Q_ 1 _ _

S E __

b9 e

9 8 7 _ _. a kg, R

E F _

L O

B

_(I {J[ C A j1 T 2388 7 9 - 1 5 7 1

_ S NASE c99 t

8 8 _

E - T O_ 4

.G -

A 22 7 g. A _ D 9- 9 5 8 8

8 _o37 w 7 5 3

3 1 R kuC _

n 9 8 t c

8 9 8 6 u E J

2 5 7 1 O

4 6 9 3 2,l V

A

^% _

Na4 _

9- 2 3 9 7 _

9 3 6 2 8 _

_ b 9 9 pr- 9 9 9 8 e

F A 0 _

NA2

_ h 9 6 7

_ 1 .

9- 9 51 5 7 3 6 6 _

9- 0 4 9 6 N

_ 0 t

c 8 98 t c 9 9 8 7 N6O O

Y

. O 0 ,

Ee4 iT] #

l}

E) E M T M T E I T E I S T D S T D _

Eag E I E E E

/

I T D T S T D T S

'~

L E D A M L E A M L T E L T D E A A L T R

E L V W D

I I

A I A A L T R D A E L V W I

I I

k 5

/ 0 5 0 5 "9 o V

O'V U NA N E V A N E O V UN 9

'd0Ob B 7 6

Validated items (previously exposed), enjoy the highest mean scores while new items represent the other extreme in a consistent pattern of lowest mean scores; unvalidated (modified) items lie between examination mean. the extremes but somewhat below the overall In only one irstance (BWR examination in October 1994) did the unvalidated (modified) items have a higher item difficulty level than the overall mean score. There appears to be r.o explanr on for this other than to speculate that the modified items were not meaningfully altered.

Discussion These results are intuitively unsurprising. One would certainly expect validated items, previously exposed and subject to test taker rehearsal and review, to be less discriminating and easier.

Similarly, at the other extreme, the 10 new items, as unseen before, are most discriminatory and seemingly "most difficult."

The reason for the lower performance of the new items is likely attributable basis, duringtotheir theirdevelopment, newness alone since there is no intended difficult than the remaining items.to make them inherently Additionally, sinceany more there are only ten new items per examination administration, their reported mean sccres are more sensitive to variation, having greater fluctuations and instability; however, their pattern of high discrimination relative to the other reported mean score sub-categories is constant.

The balanced use of test item banks with modified and new items combine to produce a discriminating examination, and moreover, arguably promote learning and improve performance -- an unintended objective but welcome outcome to the testing process.

Test item banks serve as a valuable resource for learning and represent a resource for training and test development; however, when all or too high a portion of items for an examination are drawn from the validated bank and are identical to those items that have been previously used for testing, the banks are inappropriately used. Put in other words, previously administered items reduce examination integrity because examination discrimination is reduced. Discrimination is reduced because candidates are likely tested at a simple recognition level; arguably, comprehension and analysis levels of knowledge assessment may not be assessable since mental thought has been reduced to a recognition level only, i.e., remembering only the question and its paired answer; problem solving as a desirable mental process is similarly questionable because test items having been rehearsed and anticipated likely default to a level of rote-style answer recognition.

9

Furthermore, when the bank of items from which the exam is drawn and is known to the candidates prior to the exam, then the exam is regarded to be highly predictable. Predictable exams tend not to discriminate because what is being tested is simple recognition of the answer. Although some examination predictability in the study of item banks is arguably worthwhile because studying past examinations can have a positive learning value, total predictability of examination coverage through over-reliance upon examination banks reduces examination integrity; it j reduces the examination to a non-discriminatory tool that, at i best, items. only tests simple memory recognition of anticipated test I Compounding this predictability problem is an additional threat to validity: If test candidates know the precise and finite item pool from which test items will be drawn, they will tend only to study from that pool (i.e., studying to the test) and may likely exclude from study the larger domain of job knowledge; when this occurs, the validity inferences normally made from performance on ;

the test to that of the larger realm of knowledge or skill to be l mastered be made. from which the test sample was based, cannot confidently Related Research As stated earlier, there is little empirical research available I on discriminatory effectiveness of item types in criterion referenced tests. In research conducted by Hale et al.,

Educational Testing Service 1980), " Effects of Item Disclosure on TOEFL Performance," resear(chers found differences in performance among test takers who had access to disclosed items verses a control group who did not. The study, which covered only a four week period, showed a positive score increase of 6.3 points when students received instruction from and studied from a test bank of 900 items and only a 2.9 point score increase when receiving instruction from and studying from an item bank of 1800 items over the control groups who had no such access to item banks for instruction and study.

The average disclosure effect--the difference between scores for the disclosed and undisclosed post tests--was 4.6 percentage

+

points (6.3 percentage points for the 900 items bank and 2.9 percentage points for the 1800 item bank.) The data suggested that as the bank size became larger, the disclosure effect on performance became smaller.

generalization is the amount One variable time that would affect this of study available. In the TOEFL experiment the disclosed forms were available for four weeks.

If disclosed items were to be made available for a much longer time, students would be able to study a large item pool more carefully. Whether they would actually do so is an open 10

5 question as there may be a limit to how early students would

begin preparation for a test in earnest--perhaps only a few weeks before the test as in the present situation. Nevertheless, as a rsasonable assumption, the longer disclosed items are available, the greater is students' opportunity to study them -- and thus, the greater is the expected effect of disclosure on test performance.

In sum, evidence points out that test taker performance on the GFE and TOEFL are clearly affected by item disclosure. When items are made available to the students for several weeks prior to administration of the test, students who study at least some of these items will increase their scores as a result. There are two categories of effect that can result from test preparation activities with disclosed items: The first is a general learning effect, i.e., l an improvement in performance resulting from experience with the test in general. The second is a specific recall effect, i.e., an increase in performance due to recall of the specific questions and answers encountered in the test. It is this second effect with which th? TOEFL study concerned itself i and upon which the GFE study reinforces: Performance is clearly !

affected positively by item disclosure. Moreover, it is believed that a disclosure effect on written test items would also be observed items. in an operational test if the test included disclosed ;

i Implications One question that may arise is what amount of examination integrity is lost through reduced item discrimination, i.e.,

higher scores and less discriminatory effectiveness with the l group of disclosed items? The intent of any examination is to discriminate along some continuum of knowledge; moreover, an

' integral component of any examinaticn's validity lies in its ability to discriminate, whether amang people or upon knowledge.

In the case of the GFE, a criterion-referenced test, knowledge discrimination is the intent. Naturally, discrimination upon knowledge will also tend to discriminate among people between those who pass and those who fail -- a type of cognitive ability partitioning. In essence, those persons who fail the examination have failed to meet the 80 percent cut score addressing minimal level of knowledge.

Since disclosed items are available to all who take the examination, regardless of ability, those benefitting most from disclosed items will tend to be those with less overall knowledge l or ability since it can be assumed that those with high overall ability scores. would rely less upon item disclosure for their high This test bias will gives some advantage to the less capable test taker and may create a false positive.

t 11

p.

Should tesc banks then be excluded from use? or.is studying test banks poor practice? The answer to these questions can be answered in one of two ways:

Yes, if the test is comprised exclusively from the. bank. In this instance, all that is likely tested is rote recognition of previously studied or rehearsed disclosed items. In effect, the test largely functions as a memory test only.

No, if the test also includes a balance of modified, different or new items. Developing a moderately discriminating criterion-referenced examination involves balancing item bank use with modified and new items. The test-takers who study the bank have a high predictability that afcertain percentage of test items f

will. reappear; however, the modified and new items, as a group, serve to reduce examination predictability and " force" test- {'

takers to prepare and study content material above and beyond that in the test. bank. It is the latter dynamic that permits the validity inference to be made confidently.

The inclusion of past disclosed test items can be beneficial since a general learning effect can result from test preparation activities with disclosed items. The review of concepts, principles, and factual details can be reviewed and reinforced through the study of test items. Items along with their answers t

can reinforce and clarify previously unclear concepts as well as-introduce new learning through the exposure to those questions.

In short, improved learning can arguably. offset soroe loss in discrimination. Moreover, since all test candidates have equal

' exposure to past test items and equal opportunity to benefit from that exposure, the discrimination loss is born equally by all candidates taking the examination. Therefore, loss of any test discrimination is uniformly felt--by all candidates and can be considered a " wash" or equalizing effect.

l l

l Implications for Rzamination Development Policy l t

The data reveal that exposure to bank items improves overall score performance.

l Given the limited data available on several

!' weeks of pre-exam review study time, it can be tentatively concluded that as bank size increases to levels beyond 900 items, the benefit to item exposure diminishes; this can be attributed to memory burden constraints.

Because item exposure may default performance to a rote

! recognition level, it is reasonable to conclude that the same exposure reduces the discriminant validity of those items and to some extent. biases the "true score" and might convey an overestimate of ability in some cases. Although the secondary benefit to-item exposure as an aid to learning is difficult to assess, it can be assumed that this exposure has a general and 12

positive learning effect especially when used by test takers who are familiar with the material, i.e., not blind recognition of the item and its answer but a meaningful analysis of why the :

answer incorrect.to the item is correct and the incorrect distractors are '

Deliberate instruction in training programs can make this meaningful analysis occur. l Summary 1 The NRC GFE continues to be an effective discriminator in identifying candidates who have not mastered the fundamentals j tested on the examination. The 50-40-10 mix and distribution of i bank, modified, and new items, respectively, has served to generate a moderately discriminating examination. Because the l l

GFE is a criterion-referenced examination, high scores are l generally expected. Over its administration, the GFE has yielded an approximate average score of 88 percent, an expected and reasonable average score given the criterion nature of the exam.

However, the exam has also yielded an average failure rate slightly above five percent; given the 4002 candidates who have l i

' taken the GFE to date, the exam has been effective in identifying over 208 candidate failures. The discriminatory ability of the j exam to identify failures who are either ill-prepared or unknowledgable is attributable to the present approach to item development that balances the use of bank, modified, and new ;

j items.

References Hale, Gordon et al (December 1980), " Effects of Item Disclosure on TOEFL Performance," Research Report No. 8, Educational Testing Service, Princeton, NJ.

This article was prepared, in part, by an employee of the United States Nuclear Regulatory Commission on his own time apart from his regular duties. NRC has neither approved nor disapproved its technical content.

l l

i 13 I

i 1

ML20217H270

Text

Navigation menu

Search