An Item Response Theory approach to the maintenance of standards in public examinations in England
Primary tabs
- Agresti, A. (2002). Categorical data analysis. Hoboken, New Jersey: John Wiley & Sons.
- Akaike, H. (1973). 2nd International Symposium on Information Theory (pp. 267–281). Budapest: Akad´emiai Kiad´o.
- Albert, J. & Ghosh, M. (2000). Item response modeling. In D. K. Dey, S. K. Ghosh & B. K. Mallick (Eds.), Generalized linear models: A Bayesian perspective (pp. 173-193). New York: Marcel Dekker.
- Alberts, R. V. J. (2001). Equating exams as a prerequisite for maintaining standards: Experience with Dutch centralised secondary examinations. Assessment in Education: Principles, Policy & Practice, 8(3), 353-367.
- Andrich, D. (2004). Controversy and the Rasch model: A characteristic of incompatible paradigms? In E. V. Smith Jr. & R. M. Smith (Eds.), Introduction to Rasch measurement (pp. 143-166). Maple Grove, MN: JAM Press.
- Assessment and Qualifications Alliance. (2008). Mathematics 4301 Specification A 2008. Manchester: Assessment and Qualifications Alliance. Retrieved from http://store.aqa.org.uk/qual/pdf/AQA-4301-W-SP-08.PDF.
- Baird, J. (2000). Are examination standards all in the head? Experiments with examiners' judgements of standards in A-level examinations. Research in Education, 64, 91-100.
- Baird, J. (2007). Alternative conceptions of comparability. In P. Newton, J. Baird, H. Goldstein, H. Patrick & P. Tymms (Eds.), Techniques for monitoring the comparability of examination standards (pp. 124-165). London: Qualifications and Curriculum Authority.
- Baird, J., Cresswell, M. J. & Newton, P. (2000). Would the real gold standard please step forward? Research Papers in Education, 15(2), 213-229.
- Baird, J. & Dhillon, D. (2005). Qualitative expert judgements on examination standards: Valid, but inexact. Manchester: Assessment and Qualifications Alliance.
- Baird, J., Fearnley, A., Fowles, D., Jones, B., Morfidi, E. & White, D. (2001). Tiering in the GCSE: A study undertaken by AQA on behalf of the Joint Council for General Qualifications. Joint Council for General Qualifications.
- Beaton, A. E. & Zwick, R. (1990). The effect of changes in the National Assessment: Disentangling the NAEP 1985-86 Reading Anomaly (Revised). National Assessment of Educational Progress, Educational Testing Service, Princeton, NJ.
- Béguin, A. A. (2000). Robustness of Equating High-Stakes Tests. (Master’s thesis) University of Twente, Enschede, Netherlands. Retrieved from http://cito.nl/share/poc/dissertaties/dissertationbeguin2000.pdf.
- Béguin, A. A. & Glas, C. (2001). MCMC estimation and some model-fit analysis of multidimensional IRT models. Psychometrika, 66(4), 541-561.
- Béguin, A. A., Wheadon, C., Meadows, M. & Eggen, T. (2007, November). Comparability of high-stakes assessments: the role of standard setting. Paper presented at the 8th annual conference of the Association for Educational Assessment (AEA) Europe, Stockholm.
- Binks, J. (2002). Official Response to the Science and Technology Parliamentary Committee Inquiry: Science Education from 14-19. London: Confederation of British Industry.
- Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. Lord & M. Novick, Statistical theories of mental test scores. Reading, MA: Addison-Wesley.
- Black, B. & Bramley, T. (2008). Investigating a judgemental rank-ordering method for maintaining standards in UK examinations. Research Papers in Education, 23(3), 357-373.
- Black, P. (2007, May). Can we design a supportive assessment system? Paper presented at the Chartered Institute of Educational Assessors, London.
- Black, P., Harrison, C., Lee, C., Marshall, B. & Wiliam, D. (2003). Assessment for learning: Putting it into practice. Maidenhead, UK: Open University Press.
- Black, P. & Wiliam, D. (1998). Inside the black box: Raising standards through classroom assessment. Phi Delta Kappan, 80(2), 139-148.
- Bock, R. D. & Moustaki, I. (2007). Item response theory in a general framework. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics, Vol. 26 (pp. 469-513). Amsterdam: Elsevier.
- Bolt, D. M., Cohen, A. S. & Wollack, J. A. (2001). A mixture item response model for multiple-choice data. Journal of Educational and Behavioral Statistics, 26(4), 381-409.
- Bolt, D. M., Cohen, A. S. & Wollack, J. A. (2002). Item parameter estimation under conditions of test speededness: Application of a mixture Rasch model with ordinal constraints. Journal of Educational Measurement, 39(4), 331-348.
- Brennan, R. L. (2008). A discussion of population invariance. Applied Psychological Measurement, 32(1), 102-114.
- Brown, M. (1989). Graded assessment and learning hierarchies in mathematics: An alternative view. British Educational Research Journal, 15(2), 121-128.
- Cameron, J. (2001). Negative Effects of Reward on Intrinsic Motivation - A Limited Phenomenon: Comment on Deci, Koestner, and Ryan (2001). Review of Educational Research 71(1): 29-42.
- Charmaz. K, (2006). Constructing grounded theory. London: Sage.
- Chen, W. & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265-289.
- Christie, T. & Forrest, G.M. (1980) Standards at GCE A-level : 1963 and 1973 : a pilot investigation of examination standards in three subjects. Basingstoke: Macmillan Education.
- Cockcroft, W. (1982). The Cockcroft Report (1982): Mathematics counts. London: Her Majesty's Stationery Office.
- Coe, R. (2007). Common Examinee Methods. In P. Newton, J. Baird, H. Goldstein, H. Patrick & P. Tymms (Eds.), Techniques for monitoring the comparability of examination standards (pp. 331-367). London: Qualifications and Curriculum Authority.
- Cohen, J. (1988) Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, N.J.: L. Erlbaum Associates.
- Cook, L. L. & Paterson, N. (1987). Problems related to the use of conventional and Item Response Theory equating methods in less than optimal circumstances. Applied Psychological Measurement, 11(3), 225-244.
- Cresswell, M. J. (1997). Examining judgements: Theory and practice of awarding public examination grades (Doctoral dissertation). London: Institute of Education, University of London.
- Cresswell, M. J. (2000). The role of public examinations in defining and monitoring standards. In Educational Standards (pp. 69-104). Oxford: Oxford University Press for the British Academy.
- Cresswell, M. J. (2010). Monitoring general qualification standards: A strategic view from AQA. Manchester: Assessment and Qualifications Alliance.
- de la Torre, J. (2009). Improving the quality of ability estimates through multidimensional scoring and incorporation of ancillary variables. Applied Psychological Measurement, 33(6), 465-485.
- Deci, E. L. (1975). Intrinsic motivation. New York: Plenum Press.
- Deci, E. L., Ryan, R. M., Koestner, R. (2001). The Pervasive Negative Effects of Rewards on Intrinsic Motivation: Response to Cameron (2001). Review of Educational Research, 71(1): 43-51.
- Department for Education and Skills. (2006). Making Good Progress: How can we help every pupil to make good progress at school? Nottingham: DfES Publications.
- Dorans, N. J. (1990). Equating methods and sampling designs. Applied Measurement in Education, 3(1), 3.
- Dorans, N. J. & Holland, P. W. (2000). Population invariance and equatability of tests: Basic theory and the linear case. Journal of Educational Measurement, 37, 281-306.
- Dorans, N. J., Pommerich, M. & Holland, P. W. (Eds.). (2007). Linking and aligning scores and scales. New York: Springer.
- Drasgow, F. & Lissak, R. (1983). Modified parallel analysis: A procedure for examining the latent dimensionality of dichotomously scored item responses. Journal of Applied Psychology, 68, 363-373.
- Eason, S. (2003). Cashing-in of curriculum 2000 AS and A-level results. Manchester: Assessment and Qualifications Alliance.
- Eason, S. (2007). GCE Information and Communication Technology (5521 / 6521): Conflict of unit standards between the January and June examinations series. Manchester: Assessment and Qualifications Alliance.
- Eason, S. (2008). Perceived conflict between GCE unit awarding outcomes from the January and June examinations series: A worked example based on AS Psychology B (5186). Manchester: Assessment and Qualifications Alliance.
- Eason, S. (2009). GCSE Sciences: Candidates’ unit-entry behaviour and the impact on overall subject awards – June 2008 and June 2009. Manchester: Assessment and Qualifications Alliance.
- Eason, S. (2010). Predicting GCSE outcomes based on candidates' prior achieved Key Stage 2 results. Manchester: Assessment and Qualifications Alliance.
- Ecclestone, K. (2006). Assessment in post-14 education: The implications of principles, practices and politics for learning and achievement (No. 2). The Nuffield Review of 14-19 Education. The Nuffield Foundation. Retrieved from http://www.nuffield14-19review.org.uk/files/documents125-1.pdf [PDF]
- Edexcel. Mathematics (2381) Modular. Retrieved 3 August, 2009.
- Educational Testing Service. (2009). GRE Details: Test Takers. Educational Testing Service.
- Eignor, D. R., Stocking, M. L. & Cook, L. L. (1990). Simulation results of effects on linear and curvilinear observed- and true-score equating procedures of matching on a fallible criterion. Applied Measurement in Education, 3(1), 37-52.
- Engineering Council (2000). Measuring the Mathematics Problem. London: The Engineering Council.
- Fawcett, J. (2005). Criteria for evaluation of theory. Nursing Science Quarterly, 18, 131-135.
- Feyerabend, P. (1988). Against Method (Rev. ed.). Verso: London/New York.
- Fischer, G. H. (2007). Rasch models. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics, Vol. 26 (pp. 515-585). Amsterdam: Elsevier.
- Fowles, D. (2009). A concurrent approach to estimating the reliability of electronic marking of long form answers. Manchester: Assessment and Qualifications Alliance.
- Fox, J. & Wyrick, C. (2008). A mixed effects randomized item response model. Journal of Educational and Behavioral Statistics, 33(4), 389-415.
- Frantz, D. & Nordheimer, J. (1997, September 28). Giant of exam business keeps quiet on cheating. New York Times.
- Gilbert, C. (2006). 2020 Vision: Report of the teaching and learning in 2020 review group. Department for Education and Skills. Nottingham: DfES Publications.
- Glas, C. & Falcon, J. C. S. (2003). A comparison of item-fit statistics for the three parameter logistic model. Applied Psychological Measurement, 27(2), 87-106.
- Good, F. & Cresswell, M. J. (1987). Grade awarding judgements in differentiated examinations. Manchester: Assessment and Qualifications Alliance.
- Good, F. & Cresswell, M. J. (1988a). Differentiated assessment: Grading and related issues. London: The Secondary Examinations Council.
- Good, F. & Cresswell, M. J. (1988b). Grading the GCSE. London: Secondary Examinations Council.
- Green, B. F. J. (1983). Notes on the efficacy of tailored tests. In H. Wainer & S. Messick (Eds.), Principals of Modern Psychological Measurement. Hillsdale, NJ: Lawrence Erlbaum Associates.
- Guttman, I. (1967). The use of the concept of a future observation in goodness-of-fit problems. Journal of the Royal Statistical Society: Series B, 29(1), 83-100.
- Hambleton, R. K., Swaminathan, H. & Rogers, J. H. (1991). Fundamentals of Item Response Theory. Newbury Park, California: Sage.
- Hanson, B. A. & Béguin, A. A. (1999). Separate versus concurrent estimation of IRT item parameters in the common item equating design [PDF]. ACT Research Report Series, PO Box 168, Iowa City, IA 52243-0168.
- Harker, R. & Tymms, P. 2004. The Effects of Student Composition on School Outcomes. School Effectiveness and School Improvement, 15(2): 177-199.
- Hitchcock, C. & Sober, E. (2004). Prediction versus accommodation and the risk of overfitting. The British Journal for the Philosophy of Science, 55(1), 1-34.
- Holland, P. W., Dorans, N. J. & Petersen, N. (2007). Equating test scores. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics, Vol. 26 (pp. 169-203). Amsterdam: Elsevier.
- Ireson, J., Hallam, S. & Hurley, C. (2005). What are the effects of ability grouping on GCSE attainment? British Educational Research Journal, 31(4), 443-458.
- Jones, B. (2002). Clerical errors in marking - Manchester office - year 2001 summer examinations. Manchester: Assessment and Qualifications Alliance.
- Jones, B. (2005). Analysis of predicted outcomes for six GCE science units in the January and June 2004 examination series. Manchester: Assessment and Qualifications Alliance.
- Jones, B. (2008). Statistical predictions for GCE new specification AS units in January 2009: A discussion paper. Manchester: Assessment and Qualifications Alliance.
- Jones, B. (2009a). Awarding GCSE and GCE - time to reform the Code of Practice? Manchester: Assessment and Qualifications Alliance.
- Jones, B. (2009b). Setting standards in the new GCE specification AS and A2 units in January 2010. Manchester: Assessment and Qualifications Alliance.
- Keifer, J. & Wolfowitz, J. (1956). Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Annals of Mathematical Statistics, 27, 887-903.
- Kim, J. & Bolt, D. M. (2007). An NCME instructional module on estimating Item Response Theory models using Markov Chain Monte Carlo methods. Educational Measurement: Issues and Practice, 26(4), 38-51.
- Kolen, M. J. (1990). Does matching in equating work: A Discussion. Applied Measurement in Education, 3(1), 97-104.
- Kolen, M. J. & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and Practices (2nd ed.). New York: Springer.
- Laming, D. (2004). Human judgment. Cengage Learning EMEA.
- Lawrence, I. M. & Dorans, N. J. (1990). Effect on equating results of matching samples on an anchor test. Applied Measurement in Education, 3(1), 19-36.
- Linacre, J. M. (1994). Sample size and item calibration (or Person Measure) stability. Rasch Measurement Transactions, 7(4), 328.
- Linacre, J. M. (2004a). Equating constants with mixed item types. Rasch Measurement Transactions, 18(3), 992.
- Linacre, J. M. (2004b). Rasch model estimation: Further topics. In E. V. Smith Jr. & R. M. Smith (Eds.), Introduction to Rasch measurement. Maple Grove, Minnesota: JAM Press.
- Linacre, J. M. (2008). A user's guide to WINSTEPS® MINISTEP: Rasch-Model Computer Programs (Program Manual 3.66.0.).
- Liu, J., Harris, D. & Schmidt, A. E. (2007). Statistical procedures used in college admissions testing. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics, Vol. 26 (pp. 1057-1091). Amsterdam: Elsevier.
- Livingston, S. A. (2004). Equating test scores (without IRT) (ETS Rep. No. LIVINGSTON). Princeton, NJ: Educational Testing Service.
- Livingston, S. A., Dorans, N. J. & Wright, N. K. (1990). What combination of sampling and equating methods works best? Applied Measurement in Education, 3(1), 73-95.
- Lord, F. (1980). Applications of Item Response Theory to practical testing problems. Hillsdale, NJ: Erlbaum.
- Lord, F. & Novick, M. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.
- Lord, F. & Wingersky, M. (1984). Comparison of IRT true-score and equipercentile observed-score "equatings". Applied Psychological Measurement, 8, 452-461.
- Luecht, R., Brumfield, T. & Breithaupt, K. (2006). A testlet assembly design for adaptive multi-stage tests. Applied Measurement in Education, 19(3), 189-202.
- Lundgren-Nilsson, Å., Tennant, A., Grimby, G. & Sunnerhagen, K. (2006). Cross diagnostic validity in a generic instrument: An example from the functional independence measure in Scandinavia. Health and Quality of Life Outcomes, 4(55).
- Mair, P. & Hatzinger, R. (2007). Extended Rasch modeling: The eRm package for the application of IRT models in R. Journal of Statistical Software, 20(9), 1-20.
- Mariano, L. T. & Junker, B. W. (2007). Covariates of the rating process in hierarchical models for multiple ratings of test items. Journal of Educational and Behavioral Statistics, 32(3), 287-314.
- Maris, G. & Bechger, T. (2007). Scoring open ended questions. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics, Vol. 26 (pp. 663-681). Amsterdam: Elsevier.
- McLeod, L. D. & Schnipke, D. L. (1999, April). Detecting items that have been memorized in the computerized adaptive testing environment. Presented at the annual meeting of the National Council on Measurement in Education, Montreal, Canada.
- Mead, A. (2006). An introduction to multi-stage testing. Applied Measurement in Education, 19(3), 185-187.
- Meyer, L. (2009a). Principles of standard setting. Manchester: Assessment and Qualifications Alliance.
- Meyer, L. (2009b). Putting education policy into practice. Manchester: Assessment and Qualifications Alliance.
- Molenaar, I. W. (1983). Some improved diagnostics for failure in the Rasch model. Psychometrika, 48, 49-72.
- Moreno, K. & Segall, D. (1997). Reliability and construct validity of CAT-ASVAB. In W. A. Sands, B. K. Waters & J. R. McBride (Eds.), Computerized adaptive testing: From inquiry to operation (pp. 169-174). Washington, DC: American Psychological Association.
- Mroch, A. A., Bolt, D. M. & Wollack, J. A. (2005). A new Multi-Class Mixture Rasch Model for test speededness. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal, Canada.
- Newton, P. (2005a). Examination standards and the limits of linking. Assessment in Education, 12, 105-123.
- Newton, P. (2005b). The public understanding of measurement inaccuracy. British Educational Research Journal, 31(4), 419-442.
- Newton, P. (2007). Clarifying the purposes of educational assessment. Assessment in Education: Principles, Policy and Practice, 14, 149-170.
- Newton, P. (2008). Comparability monitoring: Progress report. In P. Newton, J. Baird, H. Goldstein, H. Patrick & P. Tymms (Eds.), Techniques for monitoring the comparability of examination standards (pp. 452-476). London: Qualifications and Curriculum Authority.
- Nietzsche, F. W. (trans. 2004). Human, all too human. Cambridge: Cambridge University Press.
- Noss, R., Goldstein, H. & Hoyles, C. (1989). Graded assessment and learning hierarchies in mathematics. British Educational Research Journal, 15(2), 109-120.
- Patz, R. J. & Junker, B. W. (1999). Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses. Journal of Educational and Behavioral Statistics, 24(4), 342-366.
- Petersen, N. (2008). A discussion of population invariance of equating. Applied Psychological Measurement, 32(1), 98-101.
- Pinot de Moira, A. (2008). Statistical predictions in award meetings: How confident should we be? Manchester: AQA Centre for Education Research and Policy.
- Pinot de Moira, A. (2009a). The effects of maturity: Evidence from linear GCSE specifications. Manchester: AQA Centre for Education Research and Policy.
- Pinot de Moira, A. (2009b). Introduction of the new AS and A-level qualifications: Predictions for the winter 2009 awards. Manchester: AQA Centre for Education Research and Policy.
- Pinot de Moira, A. (2009c). Marking reliability & mark tolerances: Deriving business rules for the CMI+ marking of long form answer questions. Manchester: AQA Centre for Education Research and Policy.
- Poirier, D. J. (1988). Causal relationships and replicability. Journal of Econometrics, 39, 213-324.
- Pollitt, A. (1985). What makes exam questions difficult?: An analysis of 'O' grade questions and answers. Edinburgh: Scottish Academic Press.
- Pollitt, A., Ahmed, A. & Crisp, V. (2007). The demands of examination syllabuses and question papers. In P. Newton, J. Baird, H. Goldstein, H. Patrick & P. Tymms (Eds.), Techniques for monitoring the comparability of examination standards (pp. 166-210) London: Qualifications and Curriculum Authority.
- Qualifications and Curriculum Authority. (2009). Code of Practice. London: Author.
- R Development Core Team. (2010). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from http://www.R-project.org.
- Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research.
- Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of scientific statements. The Danish Yearbook of Philosophy, 14, 58-94.
- Reckase, M. D. (1985). The difficulty of test items that measure more than one ability. Applied Psychological Measurement, 9(4), 401-412.
- Rizopoulos, D. (2006). An R package for latent variable modelling and Item Response Theory analyses. Journal of Statistical Software, 17(5), 1-25.
- Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. The Annals of Statistics, 12(4), 1151-1172.
- Ryan, M. (2010, March 24). Tories want traditional A- level to ‘restore confidence’. BBC.
- Scharaschkin, A. & Baird, J. (2000). The effects of consistency of performance on A-level examiners' judgements of standards. British Educational Research Journal, 26, 343-357.
- Schmeiser, C. (2004). Reaffirming our raison d'etre: The ACT assessment. Paper presented at the annual meeting of the American Psychological Association, Honolulu.
- Schmitt, A. P., Cook, L. L., Dorans, N. J. & Eignor, D. R. (1990). Sensitivity of equating results to different sampling strategies. Applied Measurement in Education, 3(1), 53.
- Sinharay, S. (2005). Assessing fit of unidimensional Item Response Theory models using a Bayesian approach. Journal of Educational Measurement, 42(4), 375-394.
- Sinharay, S., Johnson, M. S. & Stern, H. S. (2006). Posterior predictive assessment of Item Response Theory models. Applied Psychological Measurement, 30(4), 298-321.
- Skaggs, G. (1990). To match or not to match samples on ability for equating: A discussion of five articles. Applied Measurement in Education, 3(1), 105-113.
- Smith, R.M. Fit Analysis in Latent Trait Measurement Models. In E.V. Smith & R.M. Smith (Eds.), Introduction to Rasch Measurement (pp. 73-92). Maple Grove, MN: JAM Press.
- Smith, R.M., Schumacker, R.E. & Bush, M.J. (2000). Examining Replication Effects in Rasch Fit Statistics. In M. Wilson, G. Engelhard (Eds.), Objective Measurement: Theory into Practice (pp. 303-318). Stamford: Ablex.
- Spalding, V. (2009). GCSE Science A: The size and effect of ‘If at first you don't succeed, try, try, again'. Manchester: AQA Centre for Education Research and Policy.
- Spiegelhalter, D., Best, N., Carlin, B. & van der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B, 64(4), 583-639.
- Spiegelhalter, D., Thomas, A., Best, N. & Lunn, D. (2003). WinBUGS User Manual (Version 1.4) [Computer manual]. Cambridge: MRC Biostatistics Unit, Institute of Public Health. Retrieved from http://www.mrcbsu.cam.ac.uk/bugs/winbugs/manual14.pdf
- Stewart, D. & Shamdasani, P. (1990). Focus groups: Theory and practice. Newbury Park, CA: Sage.
- Stringer, N. (2008, September). An appropriate role for professional judgement in maintaining standards in English general qualifications. Paper presented at the 34th annual conference of the International Association for Educational Assessment (IAEA), Cambridge, UK.
- Stringer, N. (2011). Setting and maintaining GCSE and GCE grading standards: the case for contextualised cohort-referencing. Research Papers in Education , 1-20.
- Sturtz, S., Ligges, U. & Gelman, A. (2005). R2WinBUGS: A package for running WinBUGS from R. Journal of Statistical Software, 12(3), 1-16.
- Swaminathan, H. & Gifford, J. A. (1982). Bayesian estimation in the Rasch model. Journal of Educational and Behavioral Statistics, 7(3), 175-191.
- Swaminathan, H., Hambleton, R. K. & Rogers, H. J. (2007). Assessing the fit of Item Response Theory models. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics, Vol. 26 (683-718). Amsterdam: Elsevier.
- Sykes, R. (2010). The Sir Richard Sykes Review [PDF].
- Tennant, A. & Pallant, J. F. (2006). Unidimensionality matters! (A Tale of Two Smiths?). Rasch Measurement Transactions, 20(1), 1048-51.
- Traub, R. (1983). A priori considerations in choosing an item response model. In Applications of Item Response Theory. Vancouver: Educational Research Institute of British Columbia.
- Tymms, P. & Fitz-Gibbon, C. (2001). Standards, achievement and educational performance: A cause for celebration? In J. Furlong & R. Phillips (Eds.), Education, reform and the state: Twenty-five years of politics, policy and practice (pp. 156-173). London: RoutledgeFalmer.
- van Rijn, P., Verstralen, H. & Béguin, A. A. (2009). Classification accuracy of multiple-test based decisions using Item Response Theory. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA.
- Verhelst, N. & Glas, C. (1995). The One Parameter Logistic Model. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models: Foundations, recent developments, and applications. New York: Springer-Verlag.
- Wainer, H. (with Dorans, N. J., Eignor, D., Flaugher, R., Green, B. F., Mislevy, R. J., et al.). (2000). Computerized Adaptive Testing: A Primer (2nd ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
- Wainer, H., Bradlow, E. T. & Wang, X. (2007). Testlet response theory and its applications. Cambridge: Cambridge University Press.
- Wheadon, C. & Béguin, A. A. (2010). Fears for tiers: Are candidates being appropriately rewarded for their performance in tiered examinations? Assessment in Education, 17(3), 287-300.
- Wheadon, C., Spalding, V. & Tremain, K. (2008). GCSE English A: Comparability between tiers. Manchester: AQA Centre for Education Research and Policy.
- Whitehouse, C. & Eason, S. (2007). Pseudo-aggregation for GCSE Science A (4461). Manchester: AQA Centre for Education Research and Policy.
- Wickham, H. (2009). ggplot2: elegant graphics for data analysis. New York: Springer.
- Wise, S. L., Plake, B. S. & Mitchell, J. V., Jr. (1990). Editor's Note. Applied Measurement in Education, 3(1), 1-2.
- Wollack, J. A., Youngsuk, S. & Bolt, D. M. (2007). Using the testlet model to mitigate test speededness effects. Presented at the annual meeting of the National Council on Measurement in Education, Chicago.
- Wright, B. D. & Masters, G. N. (1982). Rating scale analysis. Chicago: MESA Press.
- Wright, B. D. & Stone, M. H. (1979). Best test design. Chicago: MESA Press.
- Wright, B. D. & Stone, M. H. (1999). Measurement Essentials (2nd ed.).Wilmington, DE: Wide Range.
- Yamamoto, K. & Everson, H. (1997). Modeling the effects of test length and test time on parameter estimation using the HYBRID model. In J. Rost & R. Langeheine (Eds.), Applications of latent trait and latent class models in the social sciences (pp. 89-98). New York: Waxmann.
- Yi, Q., Harcourt Assessment, Harris, D. & Gao, X. (2008). Invariance of equating functions across different subgroups of examinees taking a science achievement test. Applied Psychological Measurement, 32(1), 62-80.
- Zeng, L. & Kolen, M. J. (1995). An alternative approach for IRT observed-score equating of number-correct scores. Applied Psychological Measurement, 19, 231-240.
- Zwick, R. (1991). Effects of item order and context on estimation of NAEP reading proficiency. Educational Measurement: Issues and Practice, 10(3), 10-16.