An Item Response Theory approach to the maintenance of standards in public examinations in England

  1. Agresti, A. (2002). Categorical data analysis. Hoboken, New Jersey: John Wiley & Sons.
  2. Akaike, H. (1973). 2nd International Symposium on Information Theory (pp. 267–281). Budapest: Akad´emiai Kiad´o.
  3. Albert, J. & Ghosh, M. (2000). Item response modeling. In D. K. Dey, S. K. Ghosh & B. K. Mallick (Eds.), Generalized linear models: A Bayesian perspective (pp. 173-193). New York: Marcel Dekker.
  4. Alberts, R. V. J. (2001). Equating exams as a prerequisite for maintaining standards: Experience with Dutch centralised secondary examinations. Assessment in Education: Principles, Policy & Practice, 8(3), 353-367.
  5. Andrich, D. (2004). Controversy and the Rasch model: A characteristic of incompatible paradigms? In E. V. Smith Jr. & R. M. Smith (Eds.), Introduction to Rasch measurement (pp. 143-166). Maple Grove, MN: JAM Press.
  6. Assessment and Qualifications Alliance. (2008). Mathematics 4301 Specification A 2008. Manchester: Assessment and Qualifications Alliance. Retrieved from http://store.aqa.org.uk/qual/pdf/AQA-4301-W-SP-08.PDF.
  7. Baird, J. (2000). Are examination standards all in the head? Experiments with examiners' judgements of standards in A-level examinations. Research in Education, 64, 91-100.
  8. Baird, J. (2007). Alternative conceptions of comparability. In P. Newton, J. Baird, H. Goldstein, H. Patrick & P. Tymms (Eds.), Techniques for monitoring the comparability of examination standards (pp. 124-165). London: Qualifications and Curriculum Authority.
  9. Baird, J., Cresswell, M. J. & Newton, P. (2000). Would the real gold standard please step forward? Research Papers in Education, 15(2), 213-229.
  10. Baird, J. & Dhillon, D. (2005). Qualitative expert judgements on examination standards: Valid, but inexact. Manchester: Assessment and Qualifications Alliance.
  11. Baird, J., Fearnley, A., Fowles, D., Jones, B., Morfidi, E. & White, D. (2001). Tiering in the GCSE: A study undertaken by AQA on behalf of the Joint Council for General Qualifications. Joint Council for General Qualifications.
  12. Beaton, A. E. & Zwick, R. (1990). The effect of changes in the National Assessment: Disentangling the NAEP 1985-86 Reading Anomaly (Revised). National Assessment of Educational Progress, Educational Testing Service, Princeton, NJ.
  13. Béguin, A. A. (2000). Robustness of Equating High-Stakes Tests. (Master’s thesis) University of Twente, Enschede, Netherlands. Retrieved from http://cito.nl/share/poc/dissertaties/dissertationbeguin2000.pdf.
  14. Béguin, A. A. & Glas, C. (2001). MCMC estimation and some model-fit analysis of multidimensional IRT models. Psychometrika, 66(4), 541-561.
  15. Béguin, A. A., Wheadon, C., Meadows, M. & Eggen, T. (2007, November). Comparability of high-stakes assessments: the role of standard setting. Paper presented at the 8th annual conference of the Association for Educational Assessment (AEA) Europe, Stockholm.
  16. Binks, J. (2002). Official Response to the Science and Technology Parliamentary Committee Inquiry: Science Education from 14-19. London: Confederation of British Industry.
  17. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. Lord & M. Novick, Statistical theories of mental test scores. Reading, MA: Addison-Wesley.
  18. Black, B. & Bramley, T. (2008). Investigating a judgemental rank-ordering method for maintaining standards in UK examinations. Research Papers in Education, 23(3), 357-373.
  19. Black, P. (2007, May). Can we design a supportive assessment system? Paper presented at the Chartered Institute of Educational Assessors, London.
  20. Black, P., Harrison, C., Lee, C., Marshall, B. & Wiliam, D. (2003). Assessment for learning: Putting it into practice. Maidenhead, UK: Open University Press.
  21. Black, P. & Wiliam, D. (1998). Inside the black box: Raising standards through classroom assessment. Phi Delta Kappan, 80(2), 139-148.
  22. Bock, R. D. & Moustaki, I. (2007). Item response theory in a general framework. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics, Vol. 26 (pp. 469-513). Amsterdam: Elsevier.
  23. Bolt, D. M., Cohen, A. S. & Wollack, J. A. (2001). A mixture item response model for multiple-choice data. Journal of Educational and Behavioral Statistics, 26(4), 381-409.
  24. Bolt, D. M., Cohen, A. S. & Wollack, J. A. (2002). Item parameter estimation under conditions of test speededness: Application of a mixture Rasch model with ordinal constraints. Journal of Educational Measurement, 39(4), 331-348.
  25. Brennan, R. L. (2008). A discussion of population invariance. Applied Psychological Measurement, 32(1), 102-114.
  26. Brown, M. (1989). Graded assessment and learning hierarchies in mathematics: An alternative view. British Educational Research Journal, 15(2), 121-128.
  27. Cameron, J. (2001). Negative Effects of Reward on Intrinsic Motivation - A Limited Phenomenon: Comment on Deci, Koestner, and Ryan (2001). Review of Educational Research 71(1): 29-42.
  28. Charmaz. K, (2006). Constructing grounded theory. London: Sage.
  29. Chen, W. & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265-289.
  30. Christie, T. & Forrest, G.M. (1980) Standards at GCE A-level : 1963 and 1973 : a pilot investigation of examination standards in three subjects. Basingstoke: Macmillan Education.
  31. Cockcroft, W. (1982). The Cockcroft Report (1982): Mathematics counts. London: Her Majesty's Stationery Office.
  32. Coe, R. (2007). Common Examinee Methods. In P. Newton, J. Baird, H. Goldstein, H. Patrick & P. Tymms (Eds.), Techniques for monitoring the comparability of examination standards (pp. 331-367). London: Qualifications and Curriculum Authority.
  33. Cohen, J. (1988) Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, N.J.: L. Erlbaum Associates.
  34. Cook, L. L. & Paterson, N. (1987). Problems related to the use of conventional and Item Response Theory equating methods in less than optimal circumstances. Applied Psychological Measurement, 11(3), 225-244.
  35. Cresswell, M. J. (1997). Examining judgements: Theory and practice of awarding public examination grades (Doctoral dissertation). London: Institute of Education, University of London.
  36. Cresswell, M. J. (2000). The role of public examinations in defining and monitoring standards. In Educational Standards (pp. 69-104). Oxford: Oxford University Press for the British Academy.
  37. Cresswell, M. J. (2010). Monitoring general qualification standards: A strategic view from AQA. Manchester: Assessment and Qualifications Alliance.
  38. de la Torre, J. (2009). Improving the quality of ability estimates through multidimensional scoring and incorporation of ancillary variables. Applied Psychological Measurement, 33(6), 465-485.
  39. Deci, E. L. (1975). Intrinsic motivation. New York: Plenum Press.
  40. Deci, E. L., Ryan, R. M., Koestner, R. (2001). The Pervasive Negative Effects of Rewards on Intrinsic Motivation: Response to Cameron (2001). Review of Educational Research, 71(1): 43-51.
  41. Department for Education and Skills. (2006). Making Good Progress: How can we help every pupil to make good progress at school? Nottingham: DfES Publications.
  42. Dorans, N. J. (1990). Equating methods and sampling designs. Applied Measurement in Education, 3(1), 3.
  43. Dorans, N. J. & Holland, P. W. (2000). Population invariance and equatability of tests: Basic theory and the linear case. Journal of Educational Measurement, 37, 281-306.
  44. Dorans, N. J., Pommerich, M. & Holland, P. W. (Eds.). (2007). Linking and aligning scores and scales. New York: Springer.
  45. Drasgow, F. & Lissak, R. (1983). Modified parallel analysis: A procedure for examining the latent dimensionality of dichotomously scored item responses. Journal of Applied Psychology, 68, 363-373.
  46. Eason, S. (2003). Cashing-in of curriculum 2000 AS and A-level results. Manchester: Assessment and Qualifications Alliance.
  47. Eason, S. (2007). GCE Information and Communication Technology (5521 / 6521): Conflict of unit standards between the January and June examinations series. Manchester: Assessment and Qualifications Alliance.
  48. Eason, S. (2008). Perceived conflict between GCE unit awarding outcomes from the January and June examinations series: A worked example based on AS Psychology B (5186). Manchester: Assessment and Qualifications Alliance.
  49. Eason, S. (2009). GCSE Sciences: Candidates’ unit-entry behaviour and the impact on overall subject awards – June 2008 and June 2009. Manchester: Assessment and Qualifications Alliance.
  50. Eason, S. (2010). Predicting GCSE outcomes based on candidates' prior achieved Key Stage 2 results. Manchester: Assessment and Qualifications Alliance.
  51. Ecclestone, K. (2006). Assessment in post-14 education: The implications of principles, practices and politics for learning and achievement (No. 2). The Nuffield Review of 14-19 Education. The Nuffield Foundation. Retrieved from http://www.nuffield14-19review.org.uk/files/documents125-1.pdf [PDF]
  52. Edexcel. Mathematics (2381) Modular. Retrieved 3 August, 2009.
  53. Educational Testing Service. (2009). GRE Details: Test Takers. Educational Testing Service.
  54. Eignor, D. R., Stocking, M. L. & Cook, L. L. (1990). Simulation results of effects on linear and curvilinear observed- and true-score equating procedures of matching on a fallible criterion. Applied Measurement in Education, 3(1), 37-52.
  55. Engineering Council (2000). Measuring the Mathematics Problem. London: The Engineering Council.
  56. Fawcett, J. (2005). Criteria for evaluation of theory. Nursing Science Quarterly, 18, 131-135.
  57. Feyerabend, P. (1988). Against Method (Rev. ed.). Verso: London/New York.
  58. Fischer, G. H. (2007). Rasch models. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics, Vol. 26 (pp. 515-585). Amsterdam: Elsevier.
  59. Fowles, D. (2009). A concurrent approach to estimating the reliability of electronic marking of long form answers. Manchester: Assessment and Qualifications Alliance.
  60. Fox, J. & Wyrick, C. (2008). A mixed effects randomized item response model. Journal of Educational and Behavioral Statistics, 33(4), 389-415.
  61. Frantz, D. & Nordheimer, J. (1997, September 28). Giant of exam business keeps quiet on cheating. New York Times.
  62. Gilbert, C. (2006). 2020 Vision: Report of the teaching and learning in 2020 review group. Department for Education and Skills. Nottingham: DfES Publications.
  63. Glas, C. & Falcon, J. C. S. (2003). A comparison of item-fit statistics for the three parameter logistic model. Applied Psychological Measurement, 27(2), 87-106.
  64. Good, F. & Cresswell, M. J. (1987). Grade awarding judgements in differentiated examinations. Manchester: Assessment and Qualifications Alliance.
  65. Good, F. & Cresswell, M. J. (1988a). Differentiated assessment: Grading and related issues. London: The Secondary Examinations Council.
  66. Good, F. & Cresswell, M. J. (1988b). Grading the GCSE. London: Secondary Examinations Council.
  67. Green, B. F. J. (1983). Notes on the efficacy of tailored tests. In H. Wainer & S. Messick (Eds.), Principals of Modern Psychological Measurement. Hillsdale, NJ: Lawrence Erlbaum Associates.
  68. Guttman, I. (1967). The use of the concept of a future observation in goodness-of-fit problems. Journal of the Royal Statistical Society: Series B, 29(1), 83-100.
  69. Hambleton, R. K., Swaminathan, H. & Rogers, J. H. (1991). Fundamentals of Item Response Theory. Newbury Park, California: Sage.
  70. Hanson, B. A. & Béguin, A. A. (1999). Separate versus concurrent estimation of IRT item parameters in the common item equating design [PDF]. ACT Research Report Series, PO Box 168, Iowa City, IA 52243-0168.
  71. Harker, R. & Tymms, P. 2004. The Effects of Student Composition on School Outcomes. School Effectiveness and School Improvement, 15(2): 177-199.
  72. Hitchcock, C. & Sober, E. (2004). Prediction versus accommodation and the risk of overfitting. The British Journal for the Philosophy of Science, 55(1), 1-34.
  73. Holland, P. W., Dorans, N. J. & Petersen, N. (2007). Equating test scores. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics, Vol. 26 (pp. 169-203). Amsterdam: Elsevier.
  74. Ireson, J., Hallam, S. & Hurley, C. (2005). What are the effects of ability grouping on GCSE attainment? British Educational Research Journal, 31(4), 443-458.
  75. Jones, B. (2002). Clerical errors in marking - Manchester office - year 2001 summer examinations. Manchester: Assessment and Qualifications Alliance.
  76. Jones, B. (2005). Analysis of predicted outcomes for six GCE science units in the January and June 2004 examination series. Manchester: Assessment and Qualifications Alliance.
  77. Jones, B. (2008). Statistical predictions for GCE new specification AS units in January 2009: A discussion paper. Manchester: Assessment and Qualifications Alliance.
  78. Jones, B. (2009a). Awarding GCSE and GCE - time to reform the Code of Practice? Manchester: Assessment and Qualifications Alliance.
  79. Jones, B. (2009b). Setting standards in the new GCE specification AS and A2 units in January 2010. Manchester: Assessment and Qualifications Alliance.
  80. Keifer, J. & Wolfowitz, J. (1956). Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Annals of Mathematical Statistics, 27, 887-903.
  81. Kim, J. & Bolt, D. M. (2007). An NCME instructional module on estimating Item Response Theory models using Markov Chain Monte Carlo methods. Educational Measurement: Issues and Practice, 26(4), 38-51.
  82. Kolen, M. J. (1990). Does matching in equating work: A Discussion. Applied Measurement in Education, 3(1), 97-104.
  83. Kolen, M. J. & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and Practices (2nd ed.). New York: Springer.
  84. Laming, D. (2004). Human judgment. Cengage Learning EMEA.
  85. Lawrence, I. M. & Dorans, N. J. (1990). Effect on equating results of matching samples on an anchor test. Applied Measurement in Education, 3(1), 19-36.
  86. Linacre, J. M. (1994). Sample size and item calibration (or Person Measure) stability. Rasch Measurement Transactions, 7(4), 328.
  87. Linacre, J. M. (2004a). Equating constants with mixed item types. Rasch Measurement Transactions, 18(3), 992.
  88. Linacre, J. M. (2004b). Rasch model estimation: Further topics. In E. V. Smith Jr. & R. M. Smith (Eds.), Introduction to Rasch measurement. Maple Grove, Minnesota: JAM Press.
  89. Linacre, J. M. (2008). A user's guide to WINSTEPS® MINISTEP: Rasch-Model Computer Programs (Program Manual 3.66.0.).
  90. Liu, J., Harris, D. & Schmidt, A. E. (2007). Statistical procedures used in college admissions testing. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics, Vol. 26 (pp. 1057-1091). Amsterdam: Elsevier.
  91. Livingston, S. A. (2004). Equating test scores (without IRT) (ETS Rep. No. LIVINGSTON). Princeton, NJ: Educational Testing Service.
  92. Livingston, S. A., Dorans, N. J. & Wright, N. K. (1990). What combination of sampling and equating methods works best? Applied Measurement in Education, 3(1), 73-95.
  93. Lord, F. (1980). Applications of Item Response Theory to practical testing problems. Hillsdale, NJ: Erlbaum.
  94. Lord, F. & Novick, M. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.
  95. Lord, F. & Wingersky, M. (1984). Comparison of IRT true-score and equipercentile observed-score "equatings". Applied Psychological Measurement, 8, 452-461.
  96. Luecht, R., Brumfield, T. & Breithaupt, K. (2006). A testlet assembly design for adaptive multi-stage tests. Applied Measurement in Education, 19(3), 189-202.
  97. Lundgren-Nilsson, Å., Tennant, A., Grimby, G. & Sunnerhagen, K. (2006). Cross diagnostic validity in a generic instrument: An example from the functional independence measure in Scandinavia. Health and Quality of Life Outcomes, 4(55).
  98. Mair, P. & Hatzinger, R. (2007). Extended Rasch modeling: The eRm package for the application of IRT models in R. Journal of Statistical Software, 20(9), 1-20.
  99. Mariano, L. T. & Junker, B. W. (2007). Covariates of the rating process in hierarchical models for multiple ratings of test items. Journal of Educational and Behavioral Statistics, 32(3), 287-314.
  100. Maris, G. & Bechger, T. (2007). Scoring open ended questions. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics, Vol. 26 (pp. 663-681). Amsterdam: Elsevier.
  101. McLeod, L. D. & Schnipke, D. L. (1999, April). Detecting items that have been memorized in the computerized adaptive testing environment. Presented at the annual meeting of the National Council on Measurement in Education, Montreal, Canada.
  102. Mead, A. (2006). An introduction to multi-stage testing. Applied Measurement in Education, 19(3), 185-187.
  103. Meyer, L. (2009a). Principles of standard setting. Manchester: Assessment and Qualifications Alliance.
  104. Meyer, L. (2009b). Putting education policy into practice. Manchester: Assessment and Qualifications Alliance.
  105. Molenaar, I. W. (1983). Some improved diagnostics for failure in the Rasch model. Psychometrika, 48, 49-72.
  106. Moreno, K. & Segall, D. (1997). Reliability and construct validity of CAT-ASVAB. In W. A. Sands, B. K. Waters & J. R. McBride (Eds.), Computerized adaptive testing: From inquiry to operation (pp. 169-174). Washington, DC: American Psychological Association.
  107. Mroch, A. A., Bolt, D. M. & Wollack, J. A. (2005). A new Multi-Class Mixture Rasch Model for test speededness. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal, Canada.
  108. Newton, P. (2005a). Examination standards and the limits of linking. Assessment in Education, 12, 105-123.
  109. Newton, P. (2005b). The public understanding of measurement inaccuracy. British Educational Research Journal, 31(4), 419-442.
  110. Newton, P. (2007). Clarifying the purposes of educational assessment. Assessment in Education: Principles, Policy and Practice, 14, 149-170.
  111. Newton, P. (2008). Comparability monitoring: Progress report. In P. Newton, J. Baird, H. Goldstein, H. Patrick & P. Tymms (Eds.), Techniques for monitoring the comparability of examination standards (pp. 452-476). London: Qualifications and Curriculum Authority.
  112. Nietzsche, F. W. (trans. 2004). Human, all too human. Cambridge: Cambridge University Press.
  113. Noss, R., Goldstein, H. & Hoyles, C. (1989). Graded assessment and learning hierarchies in mathematics. British Educational Research Journal, 15(2), 109-120.
  114. Patz, R. J. & Junker, B. W. (1999). Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses. Journal of Educational and Behavioral Statistics, 24(4), 342-366.
  115. Petersen, N. (2008). A discussion of population invariance of equating. Applied Psychological Measurement, 32(1), 98-101.
  116. Pinot de Moira, A. (2008). Statistical predictions in award meetings: How confident should we be? Manchester: AQA Centre for Education Research and Policy.
  117. Pinot de Moira, A. (2009a). The effects of maturity: Evidence from linear GCSE specifications. Manchester: AQA Centre for Education Research and Policy.
  118. Pinot de Moira, A. (2009b). Introduction of the new AS and A-level qualifications: Predictions for the winter 2009 awards. Manchester: AQA Centre for Education Research and Policy.
  119. Pinot de Moira, A. (2009c). Marking reliability & mark tolerances: Deriving business rules for the CMI+ marking of long form answer questions. Manchester: AQA Centre for Education Research and Policy.
  120. Poirier, D. J. (1988). Causal relationships and replicability. Journal of Econometrics, 39, 213-324.
  121. Pollitt, A. (1985). What makes exam questions difficult?: An analysis of 'O' grade questions and answers. Edinburgh: Scottish Academic Press.
  122. Pollitt, A., Ahmed, A. & Crisp, V. (2007). The demands of examination syllabuses and question papers. In P. Newton, J. Baird, H. Goldstein, H. Patrick & P. Tymms (Eds.), Techniques for monitoring the comparability of examination standards (pp. 166-210) London: Qualifications and Curriculum Authority.
  123. Qualifications and Curriculum Authority. (2009). Code of Practice. London: Author.
  124. R Development Core Team. (2010). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from http://www.R-project.org.
  125. Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research.
  126. Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of scientific statements. The Danish Yearbook of Philosophy, 14, 58-94.
  127. Reckase, M. D. (1985). The difficulty of test items that measure more than one ability. Applied Psychological Measurement, 9(4), 401-412.
  128. Rizopoulos, D. (2006). An R package for latent variable modelling and Item Response Theory analyses. Journal of Statistical Software, 17(5), 1-25.
  129. Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. The Annals of Statistics, 12(4), 1151-1172.
  130. Ryan, M. (2010, March 24). Tories want traditional A- level to ‘restore confidence’. BBC.
  131. Scharaschkin, A. & Baird, J. (2000). The effects of consistency of performance on A-level examiners' judgements of standards. British Educational Research Journal, 26, 343-357.
  132. Schmeiser, C. (2004). Reaffirming our raison d'etre: The ACT assessment. Paper presented at the annual meeting of the American Psychological Association, Honolulu.
  133. Schmitt, A. P., Cook, L. L., Dorans, N. J. & Eignor, D. R. (1990). Sensitivity of equating results to different sampling strategies. Applied Measurement in Education, 3(1), 53.
  134. Sinharay, S. (2005). Assessing fit of unidimensional Item Response Theory models using a Bayesian approach. Journal of Educational Measurement, 42(4), 375-394.
  135. Sinharay, S., Johnson, M. S. & Stern, H. S. (2006). Posterior predictive assessment of Item Response Theory models. Applied Psychological Measurement, 30(4), 298-321.
  136. Skaggs, G. (1990). To match or not to match samples on ability for equating: A discussion of five articles. Applied Measurement in Education, 3(1), 105-113.
  137. Smith, R.M. Fit Analysis in Latent Trait Measurement Models. In E.V. Smith & R.M. Smith (Eds.), Introduction to Rasch Measurement (pp. 73-92). Maple Grove, MN: JAM Press.
  138. Smith, R.M., Schumacker, R.E. & Bush, M.J. (2000). Examining Replication Effects  in Rasch Fit Statistics. In M. Wilson, G. Engelhard (Eds.), Objective Measurement: Theory into Practice (pp. 303-318). Stamford: Ablex.
  139. Spalding, V. (2009). GCSE Science A: The size and effect of ‘If at first you don't succeed, try, try, again'. Manchester: AQA Centre for Education Research and Policy.
  140. Spiegelhalter, D., Best, N., Carlin, B. & van der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B, 64(4), 583-639.
  141. Spiegelhalter, D., Thomas, A., Best, N. & Lunn, D. (2003). WinBUGS User Manual (Version 1.4) [Computer manual]. Cambridge: MRC Biostatistics Unit, Institute of Public Health. Retrieved from http://www.mrcbsu.cam.ac.uk/bugs/winbugs/manual14.pdf
  142. Stewart, D. & Shamdasani, P. (1990). Focus groups: Theory and practice. Newbury Park, CA: Sage.
  143. Stringer, N. (2008, September). An appropriate role for professional judgement in maintaining standards in English general qualifications. Paper presented at the 34th annual conference of the International Association for Educational Assessment (IAEA), Cambridge, UK.
  144. Stringer, N. (2011). Setting and maintaining GCSE and GCE grading standards: the case for contextualised cohort-referencing. Research Papers in Education , 1-20.
  145. Sturtz, S., Ligges, U. & Gelman, A. (2005). R2WinBUGS: A package for running WinBUGS from R. Journal of Statistical Software, 12(3), 1-16.
  146. Swaminathan, H. & Gifford, J. A. (1982). Bayesian estimation in the Rasch model. Journal of Educational and Behavioral Statistics, 7(3), 175-191.
  147. Swaminathan, H., Hambleton, R. K. & Rogers, H. J. (2007). Assessing the fit of Item Response Theory models. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics, Vol. 26 (683-718). Amsterdam: Elsevier.
  148. Sykes, R. (2010). The Sir Richard Sykes Review [PDF]. 
  149. Tennant, A. & Pallant, J. F. (2006). Unidimensionality matters! (A Tale of Two Smiths?). Rasch Measurement Transactions, 20(1), 1048-51.
  150. Traub, R. (1983). A priori considerations in choosing an item response model. In Applications of Item Response Theory. Vancouver: Educational Research Institute of British Columbia.
  151. Tymms, P. & Fitz-Gibbon, C. (2001). Standards, achievement and educational performance: A cause for celebration? In J. Furlong & R. Phillips (Eds.), Education, reform and the state: Twenty-five years of politics, policy and practice (pp. 156-173). London: RoutledgeFalmer.
  152. van Rijn, P., Verstralen, H. & Béguin, A. A. (2009). Classification accuracy of multiple-test based decisions using Item Response Theory. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA.
  153. Verhelst, N. & Glas, C. (1995). The One Parameter Logistic Model. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models: Foundations, recent developments, and applications. New York: Springer-Verlag.
  154. Wainer, H. (with Dorans, N. J., Eignor, D., Flaugher, R., Green, B. F., Mislevy, R. J., et al.). (2000). Computerized Adaptive Testing: A Primer (2nd ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
  155. Wainer, H., Bradlow, E. T. & Wang, X. (2007). Testlet response theory and its applications. Cambridge: Cambridge University Press.
  156. Wheadon, C. & Béguin, A. A. (2010). Fears for tiers: Are candidates being appropriately rewarded for their performance in tiered examinations? Assessment in Education, 17(3), 287-300.
  157. Wheadon, C., Spalding, V. & Tremain, K. (2008). GCSE English A: Comparability between tiers. Manchester: AQA Centre for Education Research and Policy.
  158. Whitehouse, C. & Eason, S. (2007). Pseudo-aggregation for GCSE Science A (4461). Manchester: AQA Centre for Education Research and Policy.
  159. Wickham, H. (2009). ggplot2: elegant graphics for data analysis. New York: Springer.
  160. Wise, S. L., Plake, B. S. & Mitchell, J. V., Jr. (1990). Editor's Note. Applied Measurement in Education, 3(1), 1-2.
  161. Wollack, J. A., Youngsuk, S. & Bolt, D. M. (2007). Using the testlet model to mitigate test speededness effects. Presented at the annual meeting of the National Council on Measurement in Education, Chicago.
  162. Wright, B. D. & Masters, G. N. (1982). Rating scale analysis. Chicago: MESA Press.
  163. Wright, B. D. & Stone, M. H. (1979). Best test design. Chicago: MESA Press.
  164. Wright, B. D. & Stone, M. H. (1999). Measurement Essentials (2nd ed.).Wilmington, DE: Wide Range.
  165. Yamamoto, K. & Everson, H. (1997). Modeling the effects of test length and test time on parameter estimation using the HYBRID model. In J. Rost & R. Langeheine (Eds.), Applications of latent trait and latent class models in the social sciences (pp. 89-98). New York: Waxmann.
  166. Yi, Q., Harcourt Assessment, Harris, D. & Gao, X. (2008). Invariance of equating functions across different subgroups of examinees taking a science achievement test. Applied Psychological Measurement, 32(1), 62-80.
  167. Zeng, L. & Kolen, M. J. (1995). An alternative approach for IRT observed-score equating of number-correct scores. Applied Psychological Measurement, 19, 231-240.
  168. Zwick, R. (1991). Effects of item order and context on estimation of NAEP reading proficiency. Educational Measurement: Issues and Practice, 10(3), 10-16.