AI and the Future of Testing Standards

If you have spent any time in credentialing assessment, you know the AERA, APA, and NCME Standards for Educational and Psychological Testing. They are the professional benchmark for responsible testing, jointly published by the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education. They are not regulation. They are something more durable than that. They are what the field expects of itself.

AI changes how assessment programmes operate, but it does not change what the Standards expect. It raises the evidence burden. This article explains where the bar moves when AI enters scoring, proctoring, or item development, and what credentialing leaders should be doing now to keep their evidence pack ahead of their technology.

“AI does not change the question. It changes the evidence you need to answer it.”

The Standards, and why they still matter

The Standards cover validity, reliability and precision, fairness, administration, scoring, reporting, and security across the assessment lifecycle. They are prescriptive guidance, not enforceable regulation, but they are widely treated as the professional benchmark. Adherence is a professional responsibility.

For credentialing programmes, they are the document you are measured against in legal challenges, accreditation reviews, and senior stakeholder scrutiny. When a candidate’s lawyer asks how you know the test was fair, the answer that lands is one that ties your evidence back to the Standards. When an employer asks why they should trust your credential, the same is true.

The Standards explicitly address licensure and certification. They expect a defined and justified credentialing test domain, typically grounded in job or practice analysis. They expect evidence about decision consistency, especially around the cut score. They expect transparency about how scores are combined. And they are explicit that passing standards should be tied to credential-worthy performance, not adjusted to manage pass rates.

Every one of these expectations becomes harder to meet when AI is added to the pipeline without a corresponding update to the evidence base.

Validity, fairness, and reliability when AI is in the loop

The core question the Standards ask is whether the evidence supports the intended interpretation of test scores and the decisions that follow from them. AI does not change that question. It changes the evidence you need to answer it.

When AI influences scoring, your validity argument now includes the AI system itself. You need to be able to explain what it does, why its outputs are appropriate for the construct, and how its behaviour holds up across the population the credential serves.

When AI influences proctoring, your fairness argument now includes the false positive and false negative rates of the AI flags, broken down by subgroup. When AI accelerates item development, your content validity argument now includes the controls that ensure generated items meet the same domain alignment and quality standards as human written items.

Most credentialing programmes have not updated their validity and fairness evidence to reflect these changes. The original evidence was assembled before AI entered the workflow. The technology has moved. The evidence base has not. That gap is the defensibility risk.

Hot zone one: AI scoring and construct-irrelevant variance

Automated scoring is the area where AI has the longest history in assessment, and where the Standards offer the clearest guidance. The central concern is construct-irrelevant variance. If a scoring model rewards features of a response that are not part of the construct being measured, the score is contaminated by something other than candidate competence.

The Standards call for three things when automated scoring is in use:

What the Standards expect of automated scoring

evaluation of validity for the relevant subgroups in the candidate population
review of algorithms for potential bias
documentation of the response characteristics at each score level and the theoretical and empirical basis for the scoring model, before operational use

For modern AI scoring, particularly anything based on large language models or neural architectures, all three are harder to do well than they were for earlier rule-based scorers. The model is less interpretable. The training data is less transparent. The behaviour can drift between model versions in ways that are not obvious until subgroup analysis catches them.

The implication for credentialing is straightforward. If you use AI in scoring, your evidence pack needs subgroup validity data, a documented theoretical basis for the model, and a monitoring plan that catches drift between versions. If your vendor cannot provide the underlying evidence, you need to be running enough independent monitoring to fill the gap.

Hot zone two: AI proctoring and false positive risk

AI proctoring and integrity tools generate flags. Each flag is a probabilistic judgment that something unusual happened in a test session. None of those judgments are individually conclusive. The Standards place strong responsibility on test users to protect both security and candidate rights, and that responsibility extends to the way flags are reviewed and acted upon.

“The risk is not the flag itself. It is treating a flag as a decision.”

The risk is not the flag itself. It is treating a flag as a decision. False positives are a fairness problem because they create a real burden on innocent candidates, and that burden often falls unevenly across subgroups. They are also a defensibility problem because they tend to surface in appeals, where the question becomes whether the credential owner had a sound process for converting a flag into an outcome.

The expected pattern is straightforward. Every flag is reviewed by a human with relevant training. Every adverse outcome has a documented rationale that does not rely solely on the flag. Every candidate has a route to challenge the decision. Subgroup outcomes are monitored to detect systematic bias in flag generation or in human review. None of this is new. What is new is the volume of flags and the ease with which they can be treated as authoritative. The companion piece on design-led integrity from Ofqual and JCQ sets out the operating pattern that closes the door on detector-only enforcement.

Hot zone three: AI-assisted item development

Generative AI accelerates item creation. That is genuinely valuable. It is also a governance risk if the speed of generation outpaces the discipline of review.

The Standards expect items to be aligned to the test domain, free of bias, and validated through pilot testing and analysis. Those expectations do not relax when an AI tool drafts the items. They become more important, because the volume increases and the human eye on each item decreases.

Credentialing programmes using AI for item development should be running:

bias review against the same subgroup considerations as human authored items
domain alignment checks against the test specifications
security controls to prevent model-mediated leakage of sensitive content
pilot testing and statistical review before items enter operational forms

The point is not to slow item development. It is to make sure the speed gain is real, by catching the items that need rework before they reach a candidate.

Cut score defensibility under AI conditions

Cut scores are where credentialing decisions concentrate. The Standards are explicit that passing standards must be tied to credential-worthy performance, not adjusted to manage pass rates. That principle holds whether scoring is done by humans, by AI, or by a combination.

When AI influences how scores are generated or combined, you need evidence that decision consistency holds across the candidate population. If the AI scorer behaves differently on responses from different subgroups, the cut score that was defensible under human scoring may no longer be defensible. This is the kind of finding that emerges from monitoring, not from initial validation. Credentialing programmes need a monitoring rhythm that surfaces these issues before they become public.

The Standards are being revised

The Standards are actively under revision. Leadership and committee structures have been established to update the guidance for the AI era. Expect sharper direction on technology in testing, and expect that direction to raise rather than lower the evidence burden.

“There is no advantage in waiting for the new edition before acting.”

The credentialing organisations that benefit from the revision will be the ones already operating to a high evidence standard. The ones that are not will face a faster catch-up than they were planning for. There is no advantage in waiting for the new edition before acting.

Three actions to close the evidence gap

Three things move the needle in the next quarter.

First, build your AI register and tie every AI use to validity, fairness, security, and decision impact. The Standards give you the lens. The register gives you the inventory to apply it. The companion piece on ISO 42001 and 23894 walks through the operating model that produces the register.

Second, update your evidence pack. If AI is involved in scoring, add subgroup validity evidence and a documented model basis. If AI influences credential decisions, add decision consistency data. If you use AI proctoring, document the human review process and the appeals route. None of this is research. It is housekeeping that makes your existing programme defensible under current conditions.

Third, push the same expectations into your vendors and partners. Your contractual position should require the evidence you need for your own validity argument. If a supplier cannot provide subgroup data, model documentation, or change notices, that is a sourcing issue, not an operational one.

Three actions this quarter

Build the AI register and tie every use to validity, fairness, security, and decision impact.
Update the evidence pack with subgroup validity data, decision consistency data, and human review documentation.
Push expectations into vendors so contractual terms supply the evidence your validity argument depends on.

The bar is rising. Move with it.

AI does not replace the AERA, APA, and NCME Standards. It raises the evidence burden in exactly the places that matter most for credentialing: scoring, proctoring, and the decisions that flow from them. The Standards have always asked credentialing bodies to demonstrate validity, fairness, and defensibility through evidence. AI changes the evidence you need, not the obligation to produce it.

The credentialing organisations that align their AI uses to the Standards now will find themselves with a defensible evidence pack when stakeholders ask. The ones that wait will find themselves explaining a gap. In a field where trust is the product, the difference is decisive. The companion article on the EU AI Act covers the regulatory side of the same picture.

Ready to update your validity and fairness evidence for AI in credentialing?

Talk to our team about how Globebyte can help you align your AI use to the Testing Standards.

Explore our services