AI Bias in Resume Screening: Why LLMs Favor Resumes Written by AI (2025 Study)
8 min read · Updated June 5, 2026
By Bogdan
In short
A 2025 study published at ACM EAAMO/AIES (Xu, Li & Jiang, arXiv:2509.00462) tested seven major LLMs — GPT-4o, GPT-4-turbo, GPT-4o-mini, LLaMA 3.3-70B, Mistral-7B, Qwen-2.5-72B, and DeepSeek-V3 — and found that AI screeners systematically prefer resumes written by their own model over identical human-written or competitor-AI versions. GPT-4o picked its own resume 82% of the time; LLaMA 3.3-70B 79%; DeepSeek-V3 72%; the rest 65–82%. In simulated hiring across 24 occupations, candidates who used the same LLM as the recruiter's screener were 23–60% more likely to be shortlisted than equally qualified candidates with human-written resumes — the largest disadvantage in business roles like sales and accounting. The cause: low-perplexity (familiar-to-the-model) text wins on autopilot. Practical takeaway: write the substance of your CV yourself, use AI only to polish wording, never paste an AI-generated CV verbatim if you don't know what screening model the employer uses, and disclose AI assistance if a role asks for that policy.
What the study actually found
In September 2025, three researchers — Jiannan Xu (University of Maryland), Gujie Li (Cornell), and Jane Yi Jiang — released a paper titled "AI Self-preferencing in Algorithmic Hiring: Empirical Evidence and Insights" (arXiv:2509.00462; presented at ACM EAAMO 2025 and AIES 2025). It's the first large-scale empirical test of a question the AI fairness community had been circling for two years: when LLMs evaluate text, do they secretly prefer text that sounds like their own writing?
The setup was clean. They took 2,245 anonymized real resumes from LiveCareer.com, spanning 24 occupational categories. For each resume, they generated AI rewrites using seven LLMs — three commercial (GPT-4o, GPT-4-turbo, GPT-4o-mini) and four open-source (LLaMA 3.3-70B, Mistral-7B, Qwen-2.5-72B, DeepSeek-V3). Then they asked each model to evaluate pairs of resumes (one written by itself, one by a human or competitor model) and pick the stronger candidate. Content quality was controlled for — same role, same experience, same achievements — only the prose surface differed.
The result: every model preferred its own writing, often dramatically. Per-model self-preference rates against human-written resumes:
- GPT-4o — 82% (picked its own resume 4 out of 5 times)
- LLaMA 3.3-70B — 79%
- DeepSeek-V3 — 72%
- GPT-4-turbo and Qwen-2.5-72B — both above 65%
- Mistral-7B — 28% (one of the few near-neutral results)
- LLaMA 3.2-3B (the smallest tested) — 11.6% (the smallest models showed almost no bias, suggesting the effect scales with model capability)
Critically, the bias survived robustness checks. Matching the resume pairs for semantic content via BERTScore and ROUGE-L, or for writing style via LIWC linguistic features, didn't make it go away. The preference wasn't tracking quality or style — it was tracking whether the text sounded like the evaluating model's own outputs.
Why this happens — the perplexity mechanism
LLMs internally score every input by perplexity — roughly, how surprised the model is by each next token. Text the model would have plausibly generated itself has low perplexity (predictable, familiar); text written in an unfamiliar style or with rare phrasings has high perplexity. The Xu/Li/Jiang study found that when LLMs are asked to rate text quality, they assign significantly higher scores to lower-perplexity passages — even when actual content quality is identical.
The simplest way to think about it: an LLM evaluator isn't really judging "is this a good resume?" It's running a half-conscious calculation of "does this resume look like the kind of thing I would have written?", and then projecting that fluency-recognition onto quality. Familiarity → fluency → judged as higher quality. Lower perplexity beats higher perplexity, full stop, even when both candidates are objectively equally good.
This is the same psychological mechanism humans have when we judge writing that mirrors our own style as "clearer" or "more professional" — but in LLMs it operates much faster and with no introspection. The researchers note this is not a bug in any specific model. It's a systemic property of how transformer-based language models score text. Every major LLM exhibits some version of it.
The real-world impact — 23-60% more shortlists
Knowing that one LLM prefers its own writing by 82% in a controlled pairwise test is academically interesting. The real question is: does this matter when you're competing against a stack of other candidates for an actual job?
To find out, the authors ran simulated hiring pipelines across all 24 occupational categories. In each round: a pool of candidates was assembled, some had their resumes refined by GPT-4o (the recruiter's screening model), others used different LLMs, others kept their original human-written resumes. The screening LLM then ranked the pool and produced a shortlist.
Candidates whose resumes had been refined by the same LLM as the screener (GPT-4o → GPT-4o) were 23-60% more likely to make the shortlist than equally qualified candidates with human-written resumes. The largest disadvantages — closer to the 60% end — appeared in business-oriented occupations: sales, accounting, marketing, customer service. The smallest gaps appeared in technical roles where resume content (specific tools, languages, certifications) outweighs prose style.
Translate that into a job-search reality: if a sales role gets 200 applicants and the recruiter's ATS pre-filters with GPT-4o, an applicant who polished their resume with GPT-4o is statistically much more likely to reach the human-review pile than an applicant of identical caliber whose resume was written without AI. That's not a small edge.
Why this is a bigger deal than "yet another AI bias"
AI fairness research has spent a decade documenting bias against demographic groups — gender, race, age, disability. Those biases are well understood and actively legislated against in the EU (AI Act high-risk classification of hiring AI) and several US jurisdictions (NYC Local Law 144, Illinois AI Video Interview Act, etc.).
Self-preference bias is different. It doesn't track a protected characteristic; it tracks whether you used the same AI brand as your employer. That sounds harmless — until you notice that the LLM market is concentrated. GPT-4o (OpenAI) is the most widely deployed screening model in the corporate world, and it's also the most widely used consumer LLM. The bias therefore systematically favors candidates with paid ChatGPT subscriptions over candidates using free Claude, Gemini, or DeepSeek — and over candidates writing without AI at all.
That's a wealth- and access-coupled bias hiding inside what looks like a neutral algorithmic process. Unlike demographic bias, no existing regulation addresses it. The authors explicitly call for expanded fairness frameworks to cover "AI-AI interactions" — biases that emerge not from how an algorithm treats a person, but from how an algorithm treats another algorithm's output. That category of harm is brand new in the policy literature.
What this means for you, the job seeker
Don't panic. The bias is real but the rational response isn't "never use AI" — it's "use AI in a way that doesn't make your CV trivially identifiable as one model's output". Five practical implications:
- Write the substance yourself. Specific achievements, numbers, dates, project names, the actual content of your role — these are facts the AI can't invent without hallucinating. Putting them down yourself is non-negotiable both for honesty and to prevent the obvious giveaways (made-up companies, suspiciously round percentages).
- Use AI to polish, not to draft. Asking an LLM to "rewrite my bullet to sound more impactful" gives you a wording improvement on top of your real content. Asking it to "write me a resume for a marketing manager" gives you generic AI prose that screams its origin to every screener — and to any human reviewer with two months of experience.
- Don't paste a single-model output verbatim. If 82% of the bias comes from one LLM recognizing its own outputs, the worst possible strategy is sending the same LLM's unedited prose to a screener that might be the same model. Mix sources: draft yourself, polish with one tool, run a final pass with a different tool, edit the result by hand.
- Don't sound robotic. The bias rewards familiar-to-the-model phrasing — but a human reviewer with experience can spot "As a results-driven professional with a passion for excellence" from across the room. The compromise is voice: keep your own sentence rhythm and word choice, use AI for grammar/concision tweaks only.
- Be honest if asked. A small but growing number of EU employers — especially in regulated sectors like finance, healthcare, and public administration — explicitly ask candidates to disclose AI assistance during applications. If a role's instructions ask, disclose. Lying about it is grounds for rescinding an offer at any point afterward.
What this means for employers and recruiters
If you're on the other side of the table, the paper has uncomfortable implications. Single-model resume screening is no longer defensible as "objective". If your ATS uses GPT-4o to rank applicants and you're hiring for a sales role, you're systematically advantaging candidates who paid for the same model — and that advantage has nothing to do with whether they can sell.
The Xu/Li/Jiang study tested two mitigations that worked:
- Prompt-level disclosure. Adding a sentence to the screening prompt — "You should not consider or infer whether resumes were written by a human or by AI. Focus only on content quality." — measurably reduced bias. Cheap, free, partial.
- Majority-vote ensemble. Combining the primary evaluator with two smaller models (e.g. GPT-4o plus LLaMA 3.2-1B and LLaMA 3.2-3B) and using majority vote for the final shortlist decision reduced bias by more than 50%. Smaller models showed near-zero self-preference, so they served as a counterweight to the primary model's bias. More compute, much more fairness.
For EU employers specifically: the AI Act classifies hiring AI as high-risk. Knowing that single-model screening introduces a measurable, non-demographic bias that disadvantages candidates with less access to the same paid AI service likely intersects with the Act's transparency and fairness obligations. Document your mitigation strategy before you need it.
The honest version: should you use AI on your CV at all?
Three honest scenarios. Pick the one that matches your reality.
Scenario A: you're applying to a large corporate that probably ATS-screens with GPT-4o. The 23-60% shortlist edge is real here. Using a major LLM to polish your CV (NOT to draft it) gives you the edge without the obvious tells. Net: use AI carefully.
Scenario B: you're applying to a small/medium company where a human recruiter reads the CV first. The screening LLM doesn't exist in this pipeline. Robotic AI prose actively hurts you here — humans can spot it, and many actively penalize it as a perceived lack of effort. Net: write it yourself, use AI sparingly for grammar.
Scenario C: you're applying to academia, healthcare, or public administration in the EU. AI disclosure is increasingly required. AI-fluent prose can read as red-flag rather than green. Net: minimal AI, prefer none, disclose if assistance was used.
Across all three scenarios, one rule is universal: don't let AI write your factual content. The achievements, numbers, and project specifics have to come from you. AI is a wording polish, not a substance generator.
How we approach this at TakeMeUp.cv
Full disclosure: we build a CV tool and we ship AI features. So this is the awkward section where we have to be transparent about our own product in an article documenting bias in AI tools.
Our AI Rewrite add-on is deliberately scoped to wording-level polishing, not bullet generation. It rewrites a bullet you wrote into a stronger version of itself — keeping your numbers, your dates, your project names, and the substance of what you actually did. It refuses to invent metrics. That's not virtue signaling; it's the only product position we can defend in a world where AI-generated CV fabrication is rampant. Our Authenticity Score add-on exists precisely because we know recruiters are starting to spot single-LLM prose.
Caveats and what we don't yet know
The Xu/Li/Jiang study is the strongest evidence we have, but a few honest caveats before you over-update on it:
- The resumes were US-context (LiveCareer.com). EU-context CVs include photos, dates of birth, GDPR-relevant fields, and locale-specific section orderings. The bias mechanism (perplexity-as-familiarity) should generalize, but the size of the effect for European hiring is not yet measured.
- The screening tasks tested were pairwise comparisons and shortlist ranking. Production ATS systems often combine LLM scoring with keyword filters, knockout questions, and weighted criteria — the LLM bias is one signal in a stack of signals.
- The study didn't test Claude (Anthropic) or Gemini (Google) — both of which are now used in production screening at scale. The direction of bias should be the same (models prefer their own outputs), but the magnitudes for those specific systems aren't in this dataset.
- Self-preference bias is one bias among many that LLMs exhibit when screening resumes. Demographic bias against women, minorities, and older candidates persists in many models — that issue is older and better documented, and it doesn't go away because we now have a new AI-vs-AI problem.
- The mitigation (majority-vote with small models) reduces bias by >50% but doesn't eliminate it. There is no current technique that removes self-preference bias entirely.
Use AI on your CV without falling into the self-preference trap (6 steps)
- 1
Draft your own substance first
Open a blank doc and write down the facts: roles, dates, employer names, project names, three to five real achievements per role with numbers if you have them. Do this BEFORE you open any AI tool. The factual layer has to come from your memory, not a model's guess.
- 2
Use AI for wording, not for content
Paste one bullet at a time into your chosen LLM and ask: "Rewrite this bullet to be more concise and impactful, keeping all factual content intact." Reject any output that adds a number, metric, or claim you didn't supply. If the AI invents things, switch tools.
- 3
Mix sources to dilute single-model fingerprints
If you used ChatGPT to polish the experience section, run the education section through a different tool (Claude, Gemini, DeepSeek) — or just edit by hand. The self-preference bias only triggers when the entire CV reads like one model's signature output.
- 4
Edit the AI output by hand
Read every AI-suggested sentence aloud. If it sounds robotic, swap one or two words to match your natural voice. Replace any "results-driven", "passion for excellence", "synergize", or "leverage" with the words you'd actually use. Voice survives the polish.
- 5
Run an ATS check before sending
Whatever AI you used, the structural ATS check (single column, real selectable text, standard section labels, no decorative photos in header) still matters. ATS keyword filters and section parsing run regardless of LLM screening — your file has to clear both layers.
- 6
Disclose if the application asks
A growing number of EU employers — especially in finance, healthcare, public administration, and academia — explicitly ask whether AI was used. If the application asks, answer honestly. Saying "yes, AI was used to polish wording, all facts and achievements were authored by me" is a defensible and increasingly expected answer.
Frequently asked questions
Should I use ChatGPT or other LLMs to write my CV in 2026?
Use them to polish wording, not to write your factual content. The Xu/Li/Jiang (2025) study shows that LLM screeners give a 23-60% shortlist edge to candidates who used the same model as the screener — but only when the writing actually reads like that model's output. A CV where you wrote the substance and an AI polished individual bullets gets most of the edge without sounding robotic to human reviewers.
Does this mean I should use the same AI tool the employer uses?
If you knew which model the employer's ATS uses, matching it would maximize the bias in your favor. In practice you almost never know. GPT-4o is the most widely deployed corporate screening model, so polishing with GPT-4o is the highest-probability bet for large corporate applications — but only as a polish, not as a draft. For small companies with human reviewers, the safer play is minimal AI usage.
Are recruiters and employers aware of this bias?
Increasingly yes, especially in EU companies preparing for the AI Act's high-risk hiring classification. Some are mitigating with majority-vote ensembles (combining a primary LLM with smaller models for shortlist decisions) which cuts the bias by over 50%. Most smaller companies using off-the-shelf ATS products are not aware and have no mitigation in place.
Can a human recruiter spot AI-written resume prose?
Experienced recruiters can spot single-model AI prose within 5-10 seconds of reading. The tells: stock phrases like "results-driven", "passion for excellence", "synergize", "leverage"; suspiciously uniform sentence length; over-symmetrical bullet structure; vague achievements with no real numbers. AI polish that keeps your sentence rhythm and replaces stock phrases with your own words mostly avoids detection.
Is using AI on my CV dishonest?
Using AI to polish wording is industry-standard practice and not dishonest in itself. Using AI to fabricate achievements, invent metrics, or claim experience you don't have IS dishonest and is grounds for rescinding any offer that results. The line is between wording (acceptable) and substance (not acceptable). Most ethical guidelines now in development across EU institutions follow this same line.
What's the difference between self-preference bias and demographic bias in hiring AI?
Demographic bias means the AI treats candidates differently based on protected characteristics like gender, race, or age — well documented since 2018, actively legislated against in the EU AI Act and several US jurisdictions. Self-preference bias means the AI treats candidates differently based on whether they used the same AI brand as the screener — first measured at scale by Xu, Li & Jiang (2025), and not currently addressed by any AI fairness regulation. Both biases can coexist in the same screening system.
Where can I read the original research?
The paper is "AI Self-preferencing in Algorithmic Hiring: Empirical Evidence and Insights" by Jiannan Xu, Gujie Li, and Jane Yi Jiang. The preprint is openly available at arXiv:2509.00462 (2025). Non-archival versions were presented at ACM EAAMO 2025 and AIES 2025 (DOI 10.1145/3757887.3767676). The arXiv version is updated more frequently and is the recommended primary source.
Will this bias get worse over time?
Two opposing forces. Worse: as more candidates use AI to write CVs, the proportion of AI-fluent text in the screening pool grows, and the bias against the shrinking human-written minority becomes more pronounced. Better: as researchers publish more findings like this one, fairness-aware ATS vendors and EU regulators are catching up. Net direction over the next 2-3 years is uncertain. The safe bet is to assume the bias persists and use the polish-not-draft strategy regardless.
Is your CV ATS-ready?
Get an instant ATS compatibility score and see exactly what to fix — free.
Check my CV