Inter-Rater Agreement for AI Scoring

When an AI scorer enters a credentialing pipeline, it joins the rater pool. It does not replace the rater pool. The Standards for Educational and Psychological Testing did not stop applying because the rater is now a model. The reliability evidence credentialing programmes have always needed for human scoring is the same evidence they now need for AI scoring, sometimes in more detail, and always with the same statistical discipline.

“AI is a rater, not an exception.”

This article is a practical guide for the people inside credentialing organisations who are responsible for that evidence. It walks through the inter-rater agreement methods, the bias and fairness diagnostics, the drift monitoring approaches, and the overall reliability frameworks that now apply to AI scoring as much as they do to human scoring. It is written for psychometricians, assessment design leads, and the technical leadership who need to know what their evidence pack should contain. It is longer than our other articles because the subject does not survive abbreviation.

The principle: AI is a rater, not an exception

The temptation when AI scoring arrives is to treat it as either better than humans or worse than humans, and to skip the work of measuring it the way humans are measured. Both are mistakes. The defensible position is to treat the AI scorer as a rater whose properties need to be evaluated using the same reliability framework you already apply to humans, with three adjustments.

The first adjustment is that the AI scorer is one entity, not many, so the inter-rater statistics that compare humans to each other now compare humans to the AI as well. The second is that the AI scorer can drift between versions in ways human raters do not, so monitoring needs to be more frequent and version-aware. The third is that the AI scorer’s behaviour across subgroups is the validity question that matters most, because subgroup performance issues in AI scoring have been one of the most consistent findings in the field.

Once those adjustments are made, the rest of the discipline carries over directly. The rest of this article walks through the methods, the thresholds, and the standards alignment that should sit in your evidence pack.

Inter-rater agreement: the foundation

Inter-rater agreement statistics measure the extent to which different raters arrive at the same conclusions about the same candidates. They are foundational because no scoring decision can be defended without evidence that it would have been made the same way by a different qualified rater.

For nominal or ordinal rating scales with two raters, Cohen’s kappa is the standard statistic. It measures agreement beyond what would be expected by chance. The widely used thresholds are: below 0.40 is poor, 0.41 to 0.60 is moderate, 0.61 to 0.80 is substantial, and above 0.80 is almost perfect. For credentialing purposes, anything below 0.60 on a scoring criterion is a finding that requires action: rater retraining, rubric clarification, or both. When the AI scorer is one of the two raters, the kappa between AI and human is the headline number that tells you whether the AI is making the same decisions as your trained human rater pool.

For three or more raters on nominal or ordinal scales, Fleiss’ kappa extends the same logic to multiple raters simultaneously. The thresholds are the same as for Cohen’s. If your scoring panel includes the AI plus several humans, Fleiss’ kappa gives you the overall agreement statistic, and the per-pair Cohen’s kappas give you the diagnostic detail.

For continuous or interval scales, the appropriate statistic is the Intraclass Correlation Coefficient, the ICC. Multiple forms exist depending on your design, including ICC(1,1), ICC(2,1), and ICC(3,k), and the choice matters for the interpretation. The general thresholds are: below 0.50 is poor, 0.50 to 0.75 is moderate, 0.75 to 0.90 is good, and above 0.90 is excellent. For high-stakes credentialing decisions, an ICC below 0.70 is the threshold below which scores may not be dependable enough to support credential awards. The Standards expect appropriate reliability coefficients to be reported, and the ICC is what answers that expectation for continuous scoring.

Agreement thresholds at a glance

Cohen’s and Fleiss’ kappa (nominal/ordinal)

below 0.40 — poor
0.41 to 0.60 — moderate (action threshold for credentialing)
0.61 to 0.80 — substantial
above 0.80 — almost perfect

Intraclass Correlation Coefficient (continuous)

below 0.50 — poor
0.50 to 0.75 — moderate
0.75 to 0.90 — good
above 0.90 — excellent (the threshold for high-stakes credential decisions is 0.70 or higher)

These statistics align directly with Standard 2.3 of the AERA, APA, and NCME framework, which expects reliability evidence including inter-rater consistency to be reported. They also align with NCCA Standard 17 for personnel certification and ISO 17024 section 9.3 for examination bodies. The standards are explicit. When AI is in the scoring pipeline, the standards still apply.

Rater bias: leniency, severity, and the AI as a rater

Inter-rater agreement tells you whether raters are converging on the same scores. Rater bias diagnostics tell you whether they are converging in the right place, or whether systematic patterns of leniency or severity are affecting the scores.

For overall leniency or harshness, the simple diagnostic is the mean score difference between raters, evaluated with a paired t-test or ANOVA. The working rule of thumb is to flag any rater whose mean deviates by more than 0.5 standard deviations from the grand mean across raters. This is the diagnostic that catches a rater who is systematically scoring half a band high or low. When the AI is one of the raters, this is the test that catches an AI scorer that is systematically generous or systematically harsh relative to the human consensus.

For more sophisticated analysis, particularly in workplace-based and performance assessments, Many-Facet Rasch Measurement, MFRM, is the established method. Often run through software such as Facets, MFRM simultaneously estimates candidate ability, item difficulty, rater severity, and other facets of the assessment design. A rater severity logit range above 1.0 is a problematic spread, indicating that the choice of rater materially affects the candidate’s score. Infit and outfit mean square statistics in the 0.5 to 1.5 range indicate acceptable rater consistency. Outside that range, the rater is either too predictable or too unpredictable to be reliable.

MFRM is especially useful when the AI scorer is added to a human panel, because it produces a comparable severity estimate for the AI alongside the humans. If the AI sits within the human severity distribution, that is supportive evidence. If it sits at the extreme, the validity argument needs to address why and how the cut score is still defensible. Standard 2.16 of the Testing Standards explicitly expects rater effects, including severity and leniency, to be estimated. MFRM is the most rigorous way to meet that expectation.

Demographic bias and DIF analysis

Inter-rater agreement and severity diagnostics tell you about consistency. They do not directly tell you about fairness. For fairness, you need differential item functioning analysis, DIF, applied to the scoring criteria as well as to the items.

DIF analysis examines whether candidates of equal underlying ability receive systematically different scores depending on their group membership. The two standard methods are the Mantel-Haenszel procedure and logistic regression. The Mantel-Haenszel delta classification gives you A for negligible DIF (below 1.0), B for moderate (1.0 to 1.5), and C for large (above 1.5). Category C findings on any scoring criterion are a flag that requires investigation.

For credentialing programmes, the subgroups that need explicit analysis include gender, ethnicity and race, age band, disability status and accommodation use, and assessment centre or location. Each of these is a subgroup where unequal treatment by AI scoring has been documented in the field, and each is a subgroup that regulatory and legal frameworks expect you to evaluate. The Standards in question include Standard 3.1 on fairness, Standard 3.6 on differential functioning, Standard 3.9 on modified assessments measuring the same construct, and Standard 3.13 on fair use. NCCA Standard 11 covers fairness and absence of bias. The Equality Act 2010 in the UK and Title VII of the Civil Rights Act in the US both attach legal obligations to group-level fairness in credentialing.

For AI scoring specifically, the DIF analysis needs to be run not just at the assessment level but at the criterion level. AI scorers can be unbiased on overall scores while still showing systematic differences on individual rubric criteria, and those differences can affect cut score outcomes for candidates who are close to the threshold. The full validity argument needs DIF results at both levels and a documented response to any category B or C findings.

Score spread and range restriction

A subtler problem in scoring quality is range restriction. A rater who only ever uses the middle of the scale, awarding 3s and 4s on a 1 to 5 rubric, is not differentiating candidates effectively even if their agreement statistics look acceptable. Similarly, a rater who exaggerates the extremes is introducing variance that does not reflect candidate ability.

The diagnostics are straightforward. For each rater, calculate the standard deviation of their scores, the range from minimum to maximum, and the kurtosis. A rater whose standard deviation is below 50 percent of the pooled standard deviation is compressed. A rater whose standard deviation is above 150 percent of the pooled is exaggerated. A leptokurtic distribution suggests the rater is clustering scores around a preferred value rather than using the full scale.

For AI scorers, range restriction is a common pattern. Models that have been fine-tuned to minimise extreme errors can end up regressing toward the mean, awarding scores that look safe but that fail to differentiate strong candidates from weak ones. This is the kind of finding that is invisible until you check, and central to validity for credentialing decisions where the cut score sits away from the median. Standard 2.16 expects rater effects to be estimated and reported, and Standard 5.7 covers score distribution properties. NCCA Standard 17 and Ofqual Condition G4 both reinforce the expectation in their respective contexts.

Rater tracking and consensus alignment

Inter-rater agreement tells you about pairwise agreement. Rater tracking tells you whether each rater is correctly rank-ordering candidates relative to the consensus.

The standard approach is to compute the Pearson or Spearman correlation between each rater’s scores and the mean or median across all raters. A correlation below 0.50 is poor tracking, indicating that the rater’s view of which candidates are stronger does not match the panel’s view. Between 0.50 and 0.70 is moderate. Above 0.70 is good. For AI scorers, this correlation is the headline number that tells you whether the AI’s ranking of candidates matches the human consensus ranking. Even when the absolute scores differ, if the rank ordering is close, the AI is capturing the same construct. When the rank ordering diverges, the AI is measuring something different from what the humans are measuring, and the validity argument has to address that gap before the AI can be used operationally.

Rater drift over time

Human raters drift. They become more lenient or harsher over the course of a long marking session, after a holiday, or in response to fatigue. AI raters drift differently. They drift when their underlying model is updated, when training data is refreshed, or when retraining changes the system’s behaviour.

The diagnostics for both kinds of drift are similar. Statistical process control charts, including X-bar and CUSUM charts, can detect gradual changes in mean scores or variance. Segmented regression can identify breakpoints in the data where rater behaviour changed. Moving-window kappa or ICC, calculated over rolling sets of scripts, can show drift in agreement statistics over time. The action thresholds are: control chart points outside plus or minus two standard deviations are a warning, points outside plus or minus three are an action signal, and CUSUM persistence signals are a finding worth investigating immediately.

For AI scoring, the additional discipline is to anchor drift monitoring to model versions. Every change to the underlying model, the prompt, the threshold, or the post-processing pipeline needs to be logged and the agreement statistics recalculated for the new version. If the new version produces statistically different scores from the previous version on a held-out validation set, the change is material and needs governance approval before it goes into operational use. Standard 2.16 expects rater behaviour to be monitored over time, and Standard 5.8 addresses extended scoring periods. NCCA Standard 17 reinforces the ongoing monitoring expectation, as does Ofqual Condition G4 in the UK. The companion piece on vendor governance sets out the contractual side of catching these changes before they happen.

Overall reliability: generalizability theory and SEM

Inter-rater statistics are powerful but they are not the whole reliability story. For credentialing programmes that take reliability seriously, generalizability theory provides the most comprehensive framework. The G-coefficient partitions total score variance into components for the candidate, the rater, the task or component, and the residual. A G-coefficient of 0.80 or higher is acceptable for most decisions. For high-stakes pass or fail decisions, 0.90 or higher is desirable.

The companion analysis is the D-study, which uses the variance components to determine whether reliability would be improved more by adding raters, adding tasks, or both. For credentialing programmes considering whether to use AI scoring as a second rater alongside humans, the D-study is the analysis that answers the question whether the AI improves reliability enough to justify its inclusion. Standard 2.3 expects reliability appropriate to the design to be reported, and Standard 2.4 covers cases where multiple sources of variance contribute. ACGME and the General Medical Council both recommend G-theory for workplace-based assessments where AI scoring is increasingly being deployed.

Alongside G-theory, the Standard Error of Measurement, SEM, provides a single-number summary that supports decision interpretation. SEM is calculated as the standard deviation multiplied by the square root of one minus the reliability. The 95 percent confidence interval around any score is the score plus or minus 1.96 times the SEM. For credentialing decisions, the critical question is whether the confidence interval around a candidate’s score crosses the cut score. If it does, the decision is statistically marginal and deserves additional review, regardless of whether the score was generated by a human, an AI, or a combination. Standard 2.5 expects SEM to be reported, and Standard 2.14 addresses score comparability. Reporting SEM and confidence intervals in your governance pack is one of the most defensible things you can do for AI-influenced decisions, because it shifts the conversation from “did the AI get it right” to “how confident can we be in any decision at this score level”.

Bringing it together: what your evidence pack should contain

For each scoring component where AI plays a role, your evidence pack should contain the following, refreshed at a defined cadence and tied to the model version operating at the time of generation.

The AI scoring evidence pack

Inter-rater agreement statistics between AI and human raters, with the appropriate kappa or ICC for the scale type
Rater bias diagnostics for the AI scorer, including mean comparison and where feasible MFRM severity estimates
DIF analysis at both assessment and criterion level, broken down by the subgroups your candidate population requires
Range and kurtosis diagnostics for the AI scorer to detect compression or exaggeration
Consensus correlation showing how the AI’s ranking aligns with the human rater pool
Drift monitoring across model versions, with statistical process control for live operation
Generalizability analysis where the design supports it, with the G-coefficient and a D-study informing whether AI inclusion improves reliability
SEM and confidence interval reporting for cut score decisions

This is a substantial body of evidence. It is also the body of evidence that defends a credential against legal, regulatory, and reputational challenges. The Standards have always asked for it. AI does not change the obligation. It changes the work.

What this enables

“The same diagnostics that catch AI problems catch human rater problems.”

Credentialing organisations that build this evidence pack are not doing extra work for its own sake. They are doing the work that lets them deploy AI scoring with confidence, defend it against challenges, and improve it over time on the basis of data rather than guesswork. They are also building the foundation for a stronger credential, because the same diagnostics that catch AI problems catch human rater problems, and the same monitoring discipline that protects against AI drift protects against rater drift across cycles.

The reliability evidence is not separate from the AI governance work. It is the part of AI governance that turns abstract statements about validity and fairness into specific numbers that auditors, regulators, and stakeholders can verify. In a credentialing market that is moving from “tell me you are fair” to “show me your evidence”, the organisations with the numbers will have the advantage. Sister pieces in this series cover the construct decision, the AI register, and the Testing Standards framing that this evidence pack supports.

If you are building your AI evidence pack and need a starting point, the methods, thresholds, and standards alignments in this article are the working baseline. They are not theoretical. They are what credible credentialing organisations are using right now to keep their AI use defensible.

Ready to build the reliability evidence pack your AI scoring needs?

Talk to our team about how Globebyte can help you put real numbers behind your validity argument.

Explore our services

Inter-Rater Agreement and AI Scoring: The Reliability Evidence You Now Need