Free Web Space | BlueHost Review  









The teaching and testing of reading comprehension in English as second language are very important activities.

The  recognition of sequence or order of events in a text is a task of literal level of reading comprehension (see Barrett cited in Bruner & Campbell, 1978, p. 199-200). The task of recognition of sequence of events can be used for measuring  the  reading comprehension of a story.

Many of the quantitative studies for the measurement of reading comprehension have employed one or more techniques such as factor analysis, item analysis, and multiple regression in the framework of classical test theory (CTT). The measurement  techniques based  on  CTT  are not the most useful statistical tools for assessing the degree of reading comprehension. Andrich and Godfrey (cited in Ludlow & Hillocks, 1985), for example, discussed sample dependent limitations inherent  in  CTT-based  analysis  of reading comprehension data.

Item response theory (IRT) is a modern measurement theory (Hambleton, Swaminathan, & Rogers, 1991). IRT is a family of several powerful models. The main advantage of using IRT is the invariance of person and item parameter estimation. When the data fit  a particular  model,  person-free item parameters and item-free person parameters can be estimated in the framework of IRT. Some investigators have applied the different models of IRT to the measurement of different aspects of reading comprehension  (  e.g. Embretson & Wetzel, 1987; Ludlow & Hillocks, 1985).

The present investigation aims to apply the Rasch Partial Credit Model (RPCM) (see Masters & Evans, 1986) of IRT to the recognition of sequence of events as a task for measuring the reading comprehension of a story.


1. To construct the Sentence Sequencing Test (SST) to measure the reading comprehension of secondary school students of English (L2).

2. To apply item response theory (IRT) to the partial credit scoring of the reading comprehension of secondary school students.

3. To validate the SST in the framework of IRT using the RPCM.


Sample.  The  sample  of 154 students including 61 students (46 girls + 15 boys) from a Sindhi medium secondary school and 93 students (38 girls + 55 boys) from two Gujarati medium secondary schools at Bhavnagar city was selected  randomly.  The  students selected in the sample were studying English as a second language at standard 9 level. It was their second year of learning English as their study of English begins practically from standard 8.

Tool.  With  a view to developing a sentence sequencing test (SST) for the present study, the story of ` a greedy dog ' was rewritten in 20 sentences with a conscious effort of assigning the definite order to each of 20 sentences. The  sentence  patterns, sentence length, and vocabulary used in the story were appropriate to the ability level of the students selected in the sample. In the SST, the 20 sentences were positioned without maintaining the definite order assigned to them. The sentences were placed randomly on the tool to minimize the effect of item placement. The final form of the SST was cyclostyled in required number of copies.

Data collection. The SST was administered to the students selected in the sample in the normal classroom situation by the first investigator in the first week of March, 1996. Students read the disordered sentences in the form of the SST. Instructions about noting  down responses were given clearly in the beginning. The students had  to assign numbers through 1 to 20 to the sentences in the place provided for against each sentence to arrange them in a proper order for reading originally meaningful story.  No strict time limit was imposed during the administration of the test. However, it took about 50 minutes in each session of the test administration.

Data  analysis. In the SST, there were 20 items and students had  to assign a number to each item for ordering them. The difference between the correct number of order and the number assigned by the students was calculated for each item. For example,  for item 1 the correct number of order was 12 and a student assigned it number 20, then the ` difference ' was 8. The items were grouped according to the possible ` difference '. As the computer program CREDIT2 of Rasch partial credit model of IRT was to use, the partial credit scoring was applied. This program analyzes data if the items have no  maximum  possible  score more than 5.  So, the  `differences' were clubbed into various categories. It is shown in Table 1.

         TABLE 1: Grouping of Items and Clubbing of  ‘Differences’


Possible Maximum ‘Difference’

Interval of  ‘Difference’

Score Catagories

7, 9, 15, 20



5 (0 to 4)

3, 6, 12, 13



6 (0 to 5)

14, 16, 17, 18



4 (0 to 3)

2, 8, 11, 19



5 (0 to4)

1, 4, 5, 10



6 (0 to 5)


The  observation of  Table 1 shows that the maximum possible difference for items 1, 4, 5, and 10 was 11, and scored with the interval of 2. So, the difference of 0 to 1 was scored as ‘5’. There were four, five, and six ordered  categories  for  assigning varying degrees of credit to item responses for the reading  comprehension. After scoring all the responses, a person * item  matrix, containing (154 persons * 20 items) 3080 data points, was prepared.

The major part of the analyses of the data in the present study was carried out by the computer program CREDIT2 on a personal computer. CREDIT2 is based on the program CREDIT (Masters, Wright, and Ludlow cited in Masters and Evans, 1986).

CREDIT2  analyzed  the  data using the RPCM of IRT. The CREDIT2- based analyses included the calculations for item and person parameter estimates, standard errors of item and person parameter estimates, item and person fit  statistics,  score  equivalence table, item probability plots (i.e. item characteristic curves -ICCs) and item information plots.

The  standard  errors, item information plots and item information indices were used, for analyzing the reliability, and fit statistics were used for testing the unidimensionality assumption for validity following the guidelines provided by Gable  et  al. (1990), Masters and Evans (1996), Wright and Masters (1982, chap. 5), Hambleton and Swaminathan (1985, chap. 6), and Reise (1990).

Item  separation  index was calculated for verifying the definition of the variable (see Wright and Masters, 1982, p.91-93) using the computer program ISI (see Joshi, 1996). The t-test was applied for checking the invariance of item and  person  parameter estimations.

To check the content validity and construct validity of the SST, the computer programs CA (for Cronbach Alpha) and POLYUD (for Cliffs' consistency indice ‘c’, based on the graph theory) were respectively used. These two programs are developed by  Rathod (1992, & 1996).

The data were analyzed with the help of four computer programs.


Item  parameter estimation. The estimation procedure implemented by the CREDIT2 is a generalization of the UCON procedure described by Wright and Stone ( cited in Masters and Evans, 1986) for the dichotomous Rasch model. The results of the  item  analysis for the SST are presented in Table 2.

It  can  be  observed from Table 2 that the CREDIT2 provided five estimates for items 1, 3, 4, 5, 6, 10, 12, and 13 ; four estimates for items 2, 7, 8, 9, 11, 15, 19, and 20; and three estimates for items 14, 16, 17, and 18 because  four  to  six  ordered performance levels (0 to 3, 0 to 4, and 0 to 5) were identified in the corresponding items.

In  the  SST, items can be thought of as ` three-step ' (4 items), ` four-step ' (8 items), or ` five-step ' (8 items) items. Thus, the estimates for each item corresponded to the transitions between the response categories (4 to 6) defined  for  it.  For example, the estimates -1.05, 0.63, -0.14, -0.17, and 1.39 logits were obtained on item 1. These estimates show that the  endorsement of step 1 is more frequent rather than step 5. The estimates for other items can be interpreted in the same way.

 It  can also be observed from Table 2 that the means and SDs of item estimates and their standard errors were -0.00 and .27; and 0.83 and .13, respectively. In the last column of Table 2 the item fit statistic is presented. If an item has a fit  value  of more than +3.00, it indicates poor item fit (Gable, et al. 1990). In the present study, the maximum item fit value was +1.16. It indicates that the items of the SST were satisfactorily fit to the model.

                     TABLE 2: Item Estimates, Standard Errors and Fit Indices for the SST


Item Estimates

Standard Errors



-1.05  0.63 -0.14 -0.17  1.39

.52 .24 .20 .17 .21



-1.29  0.08 -1.06  1.59

.73 .29 .21 .20



 0.49 -0.16 -0.11 -0.35 -0.91

.53 .40 .31 .24 .19



-0.33 -0.91 -0.43  0.31  1.12

.27 .43 .24 .17 .20



-0.30 -0.77 -0.11  0.63  0.43

.58 .36 .21 .17 .18



 1.21 -0.68 -0.02  0.65 -0.00

.30 .26 .20 .17 .18



-0.96  0.63  0.50 -1.95

.59 .30 .24 .21



 0.16 -0.46  0.63  0.66

.31 .22 .17 .19



 1.48  0.41  0.46 -1.14

.21 .19 .18 .17



-0.23 -0.05 -0.29  0.19 0.92

.47 .29 .22 .17 .19



 0.14 -1.71  0.28  0.48 

.58 .40 .18 .17



 0.98 -0.75  0.11 0.03 0.35

.36 .30 .21 .18 .17



-0.22  0.13  0.56 -0.02  0.80

.36 .23 .18 .17 .20



 0.63  0.82  0.48

.18 .17 .19



 0.09  1.05  0.55 -2.64

.40 .29 .26 .23



-0.34 --0.86 -0.07

.46 .27 .17



 0.39  0.89 -0.07 

.20 .17 .18



-1.36  0.57 -0.38

.46 .19 .17



-1.57  0.58 -0.32  0.60

.60 .22 .18 .17



-1.54  0.62  1.49 -1.49

.51 .21 .18 .17











The CREDIT2 provided Item Probability Plots (IPPs) or ICCs for each item but the space does not allow to present all the IPPs generated by the computer program for reading comprehension of all 20 items. There were probability curves, one for each response category for reading comprehension in one IPP for a given item.

The discussions of standard errors and fit statistics are presented in the relevant section of reliability and validity.

Definition of the variable. In the present investigation, the reading comprehension of secondary school students was measured. The variable was defined in terms of the 20 items of the SST. But items (and steps within the items) must be sufficiently  well separated  in  difficulty  to identify the direction and meaning of the variable (Wright and Masters, 1982, p.91). The success in defining a line of increasing intensity depends on the extent  to which the items are separated. Table 3  presents  the  item separation on the SST.

The Table 3 shows that the item separation index for the SST-based reading comprehension was 2.52; the number of item strata was 3.69; the sample reliability of item separation was 0.86.

                   TABLE 3: Item Separation for Reading Comprehension on the SST



Item Separation Index


Number of Item Strata


Sample Reliability of Item Separation



The items of the SST succeeded to reasonably define the variable of reading comprehension.

Person parameter estimation. The CREDIT2 provided individual results in terms of raw score, person parameter - ability estimate, and  a statistics summarizing the fit to the model for each student of the sample of 154 for the present study.

A summary of these individual results in terms of means, and standard deviations(SD) is presented in Table 4.

              TABLE 4: The Summary of Individual Results on the SST


Raw Score

Ability  Estimates

Standard Errors













The  observation of Table 4 reveals that the sample was markedly different in raw score, ability estimates(logits), and fit statistics. If the mean and SD of the person fit values are 0.00 and 1.00 respectively, the good model data fit is established.  In the present study, the mean and SD of the person fit values were -0.01 and 1.06, respectively. They indicated good model-data fit in the present study.

Reliability.  To  measure the reliability of the SST, Cronbach Alpha indice was computed with the computer program CA (see Rathod, 1992). The Cronbach Alpha indice for the SST was 0.61. It showed the satisfactory reliability of the test according  to  the CTT.

Two measures of reliability were provided by the CREDIT2 : (1) standard errors of item and person parameter estimates; and (2) item information plot (IIPs).

In the middle of Table 2, the standard errors for item step difficulty estimates of 20 items of the SST can be observed. The mean of the standard errors of item estimates was 0.27. It was more than three times less than the SD of item estimates. Thus,  it can be interpreted that the item estimates of the SST were reliable (see Masters and Evans, 1986). In the same way, it can be observed from Table 4 that the mean standard errors of the person parameter - ability  estimates was 0.21. It was two times  less than the SD of ability estimates. It showed the reliability of person parameter estimation.

The process of item and person parameter estimation for the SST were reliably executed in the frame work of the RPCM of IRT.

The  item  information  function provides a viable alternative to the classical concept of reliability and standard error. It is defined independently of any specific group of examinees and, moreover, represents the standard error of  measurement  at  any chosen  ability level. Thus, the precision of measurement can be determined at any level of ability that is of interest (Hambleton, and Swaminathan, 1985, p.123-124). The CREDIT2 generated an item information plot (IIP) for each item having  maximum  five categories of scoring. The IIPs made the different measurement precision of the item visual. Although the space does not allow to present all the IIPs here, the IIPs generated by the CREDIT2 can be very useful to examine the precision of measurement with the items of the SST.

Validity. In the framework of the IRT, the concept of fit statistics for items and persons is very important for validity (Wright and Masters, 1982).

Model-data  fit  issues are a major concern when applying IRT models to real test data. Poor item fit indicates that the trait or ability level (logit) estimate has questionable validity (Reise, 1990). Gable et al. (1990) considered that if  an  item  fit statistics has a value of more than +3.00, it indicates poor item fit. Masters and Evans (1986) suggested that if an item fit has a value of more than +2.00, it indicates poor item fit.

In  the  present study, the model-data fit was analyzed using the CREDIT2. In the right column of Table 2, item fit values can be observed for the SST. The maximum item fit value was +1.16. Thus, there was satisfactory fit between items and  RPCM  in  the present study.

In this study, the goodness of model-data fit in terms of satisfactory item fit values demonstrated that:  (1) items were unidimensional for measuring the reading comprehension, (2) the validity of the SST was established, (3) all items were about equally effective at discriminating among examinees (Gable et al. 1990; Reise, 1990; and Wright and Masters, 1982, chap. 5). Thus, two important assumptions (i.e. unidimensionality and equal discrimination power) of the RPCM were satisfied and the validity of the SST was established. 

The unidimensionality of the SST was checked by the computer program POLYUD also. This program calculates Cliffs' consistency indice `C' using the graph theory. For the SST, Cliffs' ` c ' was 0.53. This value demonstrated the unidimensionality as well  as the construct validity of the SST.

Invariance of parameter estimation. As the RPCM was an adequate description of data in the present research, it was interesting to test the invariance of parameter estimations.

For examining the invariance of item parameter estimation, item estimates were obtained and compared using the responses of high ability or high scoring (above average) and low ability or low scoring (below average) students  in the total sample. The RPCM item  parameter  estimates based on high and low scoring groups (HSG and LSG) were obtained from the item analyses provided by the CREDIT2. The t-test was applied to examine the significance  of  the  differences between the mean item estimates. Table 5 summarizes the results of this testing.

   TABLE 5: Invariance of Item Estimates for the SST


Number of Estimates

Mean Item Estimates

SD of Item Estimates















The observation of Table 5 reveals that the obtained t-ratio was -0.00. Hence, it was not significant at 0.05 level. The observed difference between the mean item estimates obtained from high and low scoring groups was not significant. Item estimates were sample free.

To test the invariance of person parameters, items were arranged in the difficulty order, and then using the difficult (upper 10 ) and easy (lower 10 ) items person ability estimates were obtained. The t-test was applied to examine the significance of the difference between the mean person estimates. The result of this testing is summarized in Table 6.

TABLE 6:   Invariance of Person Estimates for the SST

Item Group

Number of Students

Mean Person Estimates

 SD of Person Estimates













As  Table  6 shows, the t-ratio was 0.37. So, it was not significant at 0.05 level. The observed difference between the mean person estimates obtained using difficult and easy items was not significant. The person estimates were invariant. Person parameters were item free.


The  application of IRT for measuring reading comprehension of an English (L2) story in the present investigation was relatively successful. Generally, the SST data fit the RPCM. The invariance of item and person parameter estimates was  established.  The SST  could satisfactorily define the reading comprehension variable. The SST was a reliable and valid tool for assessing reading ability under the frameworks of IRT as well CTT. The successful application of IRT for measuring reading comprehension  is  in the line of the results obtained in the previous study.


 Bruner, J. F., & Campbell, J. J.(1978). Participating in secondary reading : A practical approach. Englewood Cliffs, NJ : Prantice-Hall.

Embretson, S. E., & Wetzel, D. C.(1987). Component latent trait models for paragraph comprehension tests. Applied Psychological Measurement, 11, 175-193.

Gable, R. K., Ludlow, L. H., & Wolf, M. B. (1990). The use of classical and Rasch latent trait models to enhance the validity of affective measures. Educational and Psychological Measurement, 50, 869-878.

Hambleton, R. K. & Swaminathan, H. (1985). Item response theory : Principles and applications. Boston : Kluwer-Nijhoff.

 Hambleton, R. K., Swaminathan, H, & Rogers, J. T. (1991). Fundamentals of item response theory. Newbury Park : Sage.

 Joshi, B. (1996). ISI : A computer program for calculating item separation index. Unpublished manuscript, College of Education, Bhavnagar.

 Ludlow, L. H., & Hillocks, Jr., G. (1985). Psychometric considerations in the analysis of reading skill hierarchies. Journal of Experimental Education, 54, 15-21.

Masters, G. N., & Evans, J. (1986). Banking non-dichotomously scored items. Applied Psychological Measurement, 10, 355-367.

Rathod, N. S. (1992). An application of item response theory to criterion-referenced testing. Unpublished Ph.D. Thesis, Bhavnagar University, Bhavnagar. (In Gujarati).

Rathod, N. S. (1996). POLYUD : A computer program to assess the unidimensionality of a test. Unpublished manuscript, Bhavnagar University, Bhavnagar.

Reise, S. P. (1990). A comparison of item-and person-fit methods of assessing model-data fit in IRT. Applied Psychological Measurement, 14, 127-137.

Wright, B. D., & Masters, G. N. (1982). Rating scale analysis : Rasch measurement. Chicago : Mesa Press.

1 Lecturer, Department of Education, Bhavnagar University, Bhavnagar 364 002 (Gujarat) INDIA.

 2 Professor, Department of Education, Bhavnagar University, Bhavnagar 364 002 (Gujarat) INDIA.

Home    Back to the List