The most-cited paper in industrial-organisational psychology is Schmidt and Hunter’s 1998 meta-analysis on the validity and utility of selection methods. It compiled 85 years of cumulative research on what actually predicts job performance, across hundreds of thousands of workers and dozens of distinct selection methods. The finding most people remember — and the one that has shaped almost every defensible hiring process since — is that structured interviews are roughly twice as predictive as unstructured ones. The corrected validity coefficient is around 0.51 for structured, 0.20–0.38 for unstructured, depending on the specific protocol.
This finding has survived multiple replications, refinements, and several waves of methodological scrutiny. A 2016 update by Sackett, Zhang, Berry and Lievens revised some of the original estimates downward but kept the core conclusion intact: structure improves predictive validity, and the improvement is large. We are working from genuinely settled science here, not contested research.
And yet most companies still run mostly unstructured interviews. The mismatch between the empirical evidence and the practice is one of the most stable findings in talent acquisition: hiring managers prefer unstructured interviews even when shown the data, because the experience of an unstructured interview feels more revealing than a structured one. The feeling is uncorrelated with the prediction.
This piece is about what the structured-interview literature actually says, how it changes when an AI conducts the interview rather than a human, and what the legal and ethical implications are for hiring teams running AI-mediated selection in the EU and UK in 2026.
What structure actually means
“Structured interview” is a loose term in everyday hiring discourse. In the IO psychology literature it has a specific meaning, and the components matter.
- Job analysis. The questions are derived from a systematic analysis of what the job requires, not from what the interviewer happens to think is important.
- Standardised question content. Every candidate is asked the same questions, or questions sampled from the same construct-equivalent pool.
- Behaviourally-anchored rating scales. Scoring uses an explicit rubric where each score level is anchored to a specific behavioural example, not a vague descriptor (“great” vs “good” vs “OK”).
- Multiple raters. More than one person scores each candidate, and the ratings are combined in a defined way.
- Note-taking and evidence-grounding. Scores are tied to specific things the candidate said, not to overall impressions.
- No reliance on idiosyncratic “fit” judgments. Structure deliberately suppresses the gut-feel signal that drives most unstructured-interview decisions.
Each of these properties individually contributes to predictive validity. Together they roughly double it. The single largest contributor is the behaviourally-anchored rating scale, because it constrains the rater’s judgment to a defined frame and reduces the “halo effect” where one strong answer biases scoring on unrelated dimensions.
Why humans struggle with structure
The reason unstructured interviewing persists despite the evidence is partly cognitive and partly cultural.
Cognitively, structured interviews are harder to run well. The interviewer has to hold a rubric in working memory, ask follow-ups within a defined frame, take notes without breaking conversational flow, and resist the temptation to chase tangents. Humans are bad at this. Even trained interviewers drift toward unstructured conversation within five minutes if they are not actively prompted to stay in frame.
Culturally, structured interviews feel impersonal. They feel like an assessment, not a conversation. Candidates sometimes describe them as cold. Interviewers sometimes describe them as constraining. The pressure to “get to know the candidate” pulls toward less structure, even though less structure is empirically worse.
The result is that even at companies with formal structured-interview programmes, the actual interview behaviour drifts toward unstructured. Studies that audit recorded interviews find that nominally structured panels often deviate from the rubric within the first ten minutes. The structure is on paper; the practice is not.
Why AI interviewers can hold structure
An AI interviewer doesn’t drift. The structure is hard-coded into the system prompt; the rubric is hard-coded into the scoring function. Every candidate gets the same construct-equivalent questions, asked in the same order, scored against the same anchors, by the same scoring agent. The drift problem disappears.
This is a genuinely large structural advantage, and it is the single strongest defensible claim for AI-mediated interviewing. Whatever else is true about AI interviews, they hold structure better than human panels in production hiring environments. The empirical effect should be — and in our pilot data, is — predictive validity at the high end of what the structured-interview literature predicts, because the construct of structure is being held more tightly than humans typically achieve in practice.
The Construct-Templated Adaptive Interviewing pattern we use at Basanite (and that we covered in the CV-grounded interviews piece) is a refinement of this: it holds the construct constant while varying the surface stimulus per candidate. The rubric is identical for every candidate; the question wording is tailored to their CV. This is the design that simultaneously preserves structure and resists the rehearsal-vs-spontaneity collapse caused by leaked question banks.
The bias problem, properly stated
The argument that AI interviews are inherently fairer than human interviews is half-right and dangerous if taken without qualification.
It is right that AI interviewers don’t form first-impression biases based on a candidate’s photograph, accent, name, or appearance — provided you have architected the system to be blind to those features, which most off-the-shelf voice-interview platforms have not. They don’t get bored on Friday afternoons. They don’t favour candidates from their alma mater. They don’t apply tougher scrutiny to candidates whose vocabulary differs from their own.
It is wrong that this makes them automatically unbiased. Three problems:
Training-data bias. An LLM trained on internet text has internalised every stereotype the internet contains. If you ask it to score “leadership” on a candidate transcript, it will score candidates who used stereotypically-confident vocabulary higher than candidates who used hedge words, even if the underlying behaviour described is identical. Hedge-word use is correlated with gender, native-language status, and cultural background. Bias gets smuggled in through vocabulary correlations the model has absorbed from training.
Construct definition bias. The dimensions you choose to score, and the anchors you choose for each level of each dimension, embed cultural assumptions. A “strong” communicator might be defined as “speaks confidently with direct claims,” which is a culturally specific style. The Schön and Polanyi tradition in tacit-knowledge research is explicit that what counts as competent practice is partly culturally constituted. A scoring rubric that doesn’t engage with this will systematically prefer candidates from cultures the rubric was written in.
Audit-blindness. A human interviewer’s bias is visible in their notes and challengeable by other reviewers. A model’s bias is encoded in latent weights you cannot inspect. The only way to surface it is to systematically audit outputs across demographic subgroups — and most companies running AI interviews don’t do this audit.
The right design for an AI interview programme makes all three of these problems explicit. The training-data bias gets mitigated by anchoring scores in candidate-specific evidence rather than vocabulary patterns (the “no score above 3 without a verbatim quote” rule we use is a direct implementation of this). The construct-definition bias gets mitigated by drawing the rubric anchors from multiple cultural traditions and running diverse-rater calibration sessions before going live. The audit-blindness gets mitigated by running quarterly subgroup audits and adjusting if disparate impact is detected.
UK GDPR Article 22 and the right to human review
For European and UK hiring teams, there is a legal layer that bears on this directly. Article 22 of the UK and EU General Data Protection Regulation grants every data subject (including candidates) the right not to be subject to a decision based solely on automated processing — including profiling — which produces legal effects or similarly significant effects. A hiring decision counts as a similarly significant effect in regulator guidance.
In practice this means three things. First, a hiring decision based solely on an AI interview score, with no human review, is prima facie unlawful under Article 22. Second, the candidate must be informed in advance that automated processing is being used, given meaningful information about the logic involved, and offered an opportunity to contest the decision and request human review. Third, the controller (in most cases the hiring company, not the AI vendor) is responsible for ensuring the necessary human review actually happens.
The legitimate operating model is not “the AI screens, the human rubber-stamps.” The legitimate operating model is “the AI produces evidence; the human reviewer makes the actual decision using that evidence as input among others.” This is also the design that produces better hires, because the AI is good at applying structure consistently and the human is good at integrating signal across the full picture. The two-stage approach we use at Basanite — the AI produces a quote-grounded report; the hiring manager runs the final decision — is built for this regulatory frame, but it is also better hiring.
Companies deploying AI interview tools in 2026 should be explicit about which interviews are governed by Article 22 (most of them) and what the documented human-review step is. The candidate-facing consent flow should mention the right, the self-serve form should be easy to find, and the internal process should ensure the review actually happens when triggered. We documented our specific implementation in our FAQ and at basanite.co.uk/data-rights.
The fairness frame to actually use
Bringing this together, the fairness frame that we think holds up under both empirical scrutiny and legal scrutiny rests on a few principles.
Structure beats unstructure. If you have a choice between a more-structured and a less-structured interview design, choose more structure. The empirical evidence here is settled. Any process that runs mostly unstructured interviews is leaving predictive validity on the table.
Structure should be construct-level, not stimulus-level. The construct (what you’re measuring) and the rubric (how you’re measuring it) should be identical across candidates. The specific stimulus (the words of the question) can and should vary so that leaked answer keys and rehearsed responses don’t contaminate the signal. This is the CV-grounded principle we covered in the relevant piece.
Scoring should be evidence-grounded. Every score above the rubric midpoint should be backed by a specific verbatim quote or specific observed behaviour. This forces both human and AI raters to anchor their judgment in things the candidate actually did, not in overall impressions. It also creates an audit trail.
Audits should be subgroup-sensitive. Any production AI interview programme should run quarterly subgroup audits — score distributions and hire rates by gender, ethnicity, and other protected characteristics where data is collected lawfully. Disparate impact in outcomes is a signal that something in the design or the model is biased even if individual decisions look defensible.
Human review should be real. A rubber-stamp human review meets the letter of GDPR Article 22 but not the spirit. The human reviewer should have enough information to overturn the AI’s implicit recommendation, and overturning should actually happen with non-negligible frequency. If your human reviewer agrees with the AI 100% of the time, the review is not adding signal.
Where the field is going
Two predictions about where structured interviewing ends up in the next 24 months.
First, the unstructured-vs-structured argument gets resolved in favour of structure, because AI interviewers make structure cheap. Companies that wouldn’t pay for human-trained structured-interview panels — most companies, frankly — can deploy AI panels that hold structure at scale. The predictive-validity floor for an average hiring process rises.
Second, the bias-and-fairness conversation gets more technically sophisticated. The current public conversation about AI hiring bias is largely vibes-based. Within 24 months, regulator guidance will likely require specific subgroup audit reporting from any AI hiring tool deployed at scale, and there will be a small ecosystem of consultancies that do nothing but run those audits. Vendors who built audit infrastructure from day one will benefit; vendors who didn’t will scramble.
For hiring teams making procurement decisions now, the question to ask vendors is concrete: show me your subgroup audit data for the last four quarters, broken out by role family and seniority band. Any vendor that can’t produce this is operating on faith. Any vendor that can has done the engineering work to make their tool legitimately deployable in a 2026 European regulatory environment.
If you want to dig deeper into the design considerations: CV-grounded interviews covers the stimulus-vs-construct distinction, AI-resistant coding interview design covers the round-two design, and hiring engineers who can use AI covers what to do once you have structure in place.