Trustworthy AI for high-stakes work — starting with evidence
Trustworthy AI for high-stakes evidence review.
Evidence Synthesis AI screens studies the way a careful expert would — deciding confidently where the call is clear, flagging the genuine judgment calls for a human, and logging its reasoning for every decision. Built for the teams who can’t afford a confident mistake: systematic reviewers, and the drug-safety groups monitoring literature for adverse events.
~80
systematic reviews published every day
>1 yr
average review, from registration to publication
8×
swing in a model's confident-error rate between reviews
01 · The problem
Screening is the bottleneck. It’s also the trap.
Title-and-abstract screening is the single largest time sink in evidence work. The obvious fix is to automate it — but a naive AI screener makes high-stakes review worse, not faster. An overconfident one corrupts every downstream step. An over-cautious one wipes out the time savings that justified using AI at all. The question was never whether AI can screen. It’s whether you can trust how it decides — and prove it afterwards.
02 · How it works
Decisive where it’s sure. Careful where it isn’t.
1
Handles the confident calls on its own
Clear includes and excludes decided automatically where the system is well-calibrated.
2
Surfaces only the genuine judgment calls
Ambiguous studies are routed to a human, so reviewer attention goes where it actually matters.
3
Shows the reasoning behind every decision
Each include, exclude, or deferral comes with the rationale the system used to get there.
4
Override anything, audit everything
Reviewers can change any decision, and the full log is the trail regulators expect.
03 · Why you can trust it
Built on deference-aware evaluation.
Most AI metrics reward confident answers and penalise hesitation — the wrong incentive when overconfidence carries real cost. Deference-aware evaluation measures whether a system recognises the limits of its own competence and steps back when it should. It credits considered deferral as correct, separates it from genuine confident error, and surfaces a class of failures that more data and bigger models won’t fix.
Validation
Validated across 6 frontier models and 5 medical domains — 2,729 studies, 16,374 screening decisions.
6
frontier models
5
medical domains
2,729
studies
16,374
screening decisions
04 · Who it’s for
The teams who can’t afford a confident mistake.
Systematic review teams
Title-and-abstract screening without the year-long grind, with an audit trail that holds up to peer scrutiny.
Pharmacovigilance
Continuous literature monitoring for adverse events, with the documented decision trail regulators expect.
Research consultancies
Evidence work at speed, with rigour you can defend to a client or regulator.
05 · Research
White Paper · 2026 · Hopperlace Research · DOI: 10.17605/OSF.IO/A69YH
Poster · Workshop on Technical AI Governance Research (TAIGR), ICML 2026
Deference-Aware Evaluation for Human-in-the-Loop AI Systems
A framework for evaluating AI systems on their capacity to recognise the limits of their own competence and defer when appropriate, alongside standard accuracy. The paper identifies two failure modes that conventional metrics conflate — penalised conservatism and genuine confident errors — and introduces deference-aware metrics that distinguish them. A cross-domain audit of six frontier models across five medical domains (2,729 studies, 16,374 screening decisions) shows that no single model is uniformly safe, and isolates a structural class of failures that calibration, ensembling, and model scaling cannot fix.
Read on OSF“A model’s confident-error rate can swing more than eightfold from one review to the next — which is why screening needs an evaluation layer that knows when its own judgments can be trusted.”
06 · Team
Who we are
Martin Walker, MPH
Co-founder, Evidence Synthesis
Background in evidence-based health and systematic review evidence synthesis; brings the domain experience that keeps the system honest about clinical reality.
Yuyu Shen
Founder
A decade building production AI across Meta, Walmart, Beamery, and Cleo; founded Hopperlace to close a gap that kept reappearing — AI deployed in high-stakes work without the means to know when its outputs can be trusted.
07 · The bigger picture
The bigger picture
Behind the product is one conviction: you should be able to see who and what stands behind an AI system before you rely on it — how it behaves, and who makes and backs it.
Independent · public-interestThat principle is why we build evidence tools that prove their own trustworthiness. It’s also why we’re building Value Compass — an independent project that brings together what’s known about who makes and funds AI tools, and the values they operate by, so people can weigh that alongside whether a tool does the job.
Contact
Get in touch
Running a systematic review or pharmacovigilance team? We’re onboarding early pilots.
hello@hopperlace.ai