Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Interrater reliability in large-scale assessments: can teachers score national tests reliably without external controls?
Umeå University, Faculty of Social Sciences, Department of applied educational science, Departement of Educational Measurement.
2015 (English)In: Practical Assessment, Research & Evaluation, ISSN 1531-7714, E-ISSN 1531-7714, Vol. 20, no 9Article in journal (Refereed) Published
Abstract [en]

In most large-scale assessment systems a set of rather expensive external quality controls are implemented in order to guarantee the quality of interrater reliability. This study empirically examines if teachers’ ratings of national tests in mathematics can be reliable without using monitoring, training, or other methods of external quality assurance. A sample of 99 booklets of students’ answers to a national test in mathematics was scored by five teachers independently. The interrater reliability was analyzed using consensus and consistency estimates, with the focus on the test as a whole, as well as on individual items. The results show that the estimates are acceptable and in many cases fairly high, irrespective of the reliability measure used. Some plausible explanations for lower interrater reliability in individual items are discussed, and some suggestions are made in the direction of further improving reliability without imposing any system of control.

Place, publisher, year, edition, pages
2015. Vol. 20, no 9
National Category
Pedagogical Work
Identifiers
URN: urn:nbn:se:umu:diva-101511OAI: oai:DiVA.org:umu-101511DiVA, id: diva2:799871
Available from: 2015-03-31 Created: 2015-03-31 Last updated: 2018-11-05Bibliographically approved
In thesis
1. Dimensions of validity: studies of the Swedish national tests in mathematics
Open this publication in new window or tab >>Dimensions of validity: studies of the Swedish national tests in mathematics
2018 (English)Doctoral thesis, comprehensive summary (Other academic)
Alternative title[sv]
Aspekter av validitet : studier av de Svenska nationella proven i matematik
Abstract [en]

The main purpose for the Swedish national tests was from the beginning to provide exemplary assessments in a subject and support teachers when interpreting the syllabus. Today, their main purpose is to provide an important basis for teachers when grading their students. Although the results from tests do not entirely decides a student’s grade, they are to be taken into special account in the grading process. Given the increasing importance and raise of the stakes, quality issues in terms of validity and reliability is attracting greater attention. The main purpose of this thesis is to examine evidence demonstrating the validity for the Swedish national tests in upper secondary school mathematics and thereby identify potential threats to validity that may affect the interpretations of the test results and lead to invalid conclusions. The validation is made in relation to the purpose that the national tests should support fair and equal assessment and grading. More specifically, the focus was to investigate how differences connected to digital tools, different scorers and the standard setting process affect the results, and also investigate if subscores can be used when interpreting the results. A model visualized as a chain containing links associated with various aspects of validity, ranging from administration and scoring to interpretation and decision-making, is used as a framework for the validation.

The thesis consists of four empirical studies presented in the form of papers and an introduction with summaries of the papers. Different parts of the validation chain are examined in the studies. The focus of the first study is the administration and impact of using advanced calculators when answering test items. These calculators are able to solve equations algebraically and therefore reduce the risk of a student making mistakes. Since the use of such calculators is allowed but not required and since they are quite expensive, there is an obvious threat to validity since the national tests are supposed to be fair and equal for all test takers. The results show that the advanced calculators were not used to a great extent and it was mainly those students who were high-achieving in mathematics that benefited the most. Therefore the conclusion was that the calculators did not affect the results.

The second study was an inter-rater reliability study. In Sweden, teachers are responsible for scoring their own students’ national tests, without any training, monitoring or moderation. Therefore it was interesting to investigate the reliability of the scoring since there is a potential risk of bias against one’s own students. The analyses showed that the agreement between different raters, analyzed with percent-agreement and kappa, is rather high but some items have lower agreement. In general, items with several correct answers or items where different solution strategies are available are more difficult to score reliably.

The cut scores set by a judgmental Angoff standard setting, the method used to define the cut scores for the national tests in mathematics, was in study three compared with a statistical linking procedure using an anchor test design in order to investigate if the cut scores for two test forms were equally demanding. The results indicate that there were no large differences between the test forms. However, one of the test taker groups was rather small which restricts the power of the analysis. The national tests do not include any anchor items and the study highlights the challenges of introducing equating, that is comparing the difficulty of different test forms, on a regular basis.

In study four, the focus was on subscores and whether there was any value in reporting them in addition to the total score. The syllabus in mathematics has been competence-based since 2011 and the items in the national tests are categorized in relation to these competencies. The test grades are only connected to the total score via the cut scores but the result for each student is consolidated in a result profile based on those competencies. The subscore analysis shows that none of the subscores have added value and the tests would have to be up to four times longer in order to achieve any significant value.

In conclusion, the studies indicate that several of the potential threats do not appear to be significant and the evidence suggests that the interpretations made and decisions taken have the potential to be valid. However, there is a need for further studies. In particular, there is a need to develop a procedure for equating that can be implemented on a regular basis.

Place, publisher, year, edition, pages
Umeå: Umeå universitet, 2018. p. 61
Series
Academic dissertations at the department of Educational Measurement, ISSN 1652-9650 ; 11
Keywords
national tests, validity, interrater reliability, standard setting, linking, subscores, test development
National Category
Pedagogical Work
Research subject
didactics of educational measurement
Identifiers
urn:nbn:se:umu:diva-153056 (URN)978-91-7601-936-8 (ISBN)
Public defence
2018-11-30, KBE303, Stora hörsalen i KBC-huset, Umeå, 10:00 (English)
Opponent
Supervisors
Available from: 2018-11-09 Created: 2018-11-05 Last updated: 2018-11-20Bibliographically approved

Open Access in DiVA

fulltext(433 kB)528 downloads
File information
File name FULLTEXT01.pdfFile size 433 kBChecksum SHA-512
758208da5e025491d126640961f33e8e3bc1a73f4a6a06a47c2962bba08e4ea6709efa5977603b76267afc33c1d6676161282fe28931faf064ad18eb7f46a71c
Type fulltextMimetype application/pdf

Other links

fulltext

Search in DiVA

By author/editor
Lind Pantzare, Anna
By organisation
Departement of Educational Measurement
In the same journal
Practical Assessment, Research & Evaluation
Pedagogical Work

Search outside of DiVA

GoogleGoogle Scholar
Total: 528 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 559 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf