CASE STUDY

How RLDatix's platform team used Confident AI to standardize evals across the enterprise

We hit a point where every AI team was building their own eval stack. That's fine for one product. With five, ten, fifteen AI initiatives across the portfolio, it's never going to live up to our high standards of AI governance.

Richard Jarvis
Richard JarvisChief Technology Officer

THE COMPANY

RLDatix is the operating layer behind safer healthcare

RLDatix is the software backbone of healthcare safety across the NHS and beyond. Its products are deployed in nearly every NHS Trust in the United Kingdom, and across thousands of hospitals in North America, Europe, and Asia-Pacific. They power workforce management, risk and incident reporting, regulatory compliance, and policy management — the systems behind making sure the right patient gets the right care, from the right clinician, at the right time.

RLDatix sits in the workflow whenever a nurse reports an adverse event, a manager rosters a ward, or a hospital answers to its regulator. The company's mission is straightforward and serious: make healthcare safer by giving the people who deliver it better tools.

That mission now extends into AI. Across the portfolio, RLDatix is shipping AI features that reshape clinical and administrative workflows — not sit alongside them. And because the cost of getting AI wrong in healthcare isn't a bad user experience but a patient harm, quality, trust, and safety couldn't be afterthoughts. They had to be infrastructure.

THE BUILDUP

When patient lives are at stake, AI can't afford to be wrong

RLDatix's AI ambitions span the portfolio — multiple products, multiple geographies, multiple teams shipping in parallel. The two flagship initiatives below illustrate why every one of them lives or dies on rigorous evaluation.

In the UK, Ask Optima is an embedded assistant inside Optima, RLDatix's workforce management platform used to roster staff at NHS hospitals. Administrators can ask how to perform tasks inside the product, and the agent answers from ingested user guides. The goal, as senior software engineer Steve Hurcombe puts it, is to shift support left — catch the question before it becomes a ticket.

The consequences of getting it wrong, though, cut in two directions — and both matter.

  • Reputational. Ask Optima is deployed across the NHS. A manipulated or hallucinated response doesn't stay inside the product — it becomes a screenshot, a headline, a question in front of a Trust's board. "You don't want any embarrassing stuff coming out," Steve says.
  • Operational. Ask Optima answers questions about how to roster staff across 24/7 wards. A wrong answer about configuring a shift pattern can propagate into how a ward is staffed — and rosters drive who shows up to care for which patients. Quality and safety aren't separate properties here. They're the same property.

In North America, software architect Antonio Drusin's team built an AI-assisted intake flow on top of RLDatix's risk and safety product. Nurses speak or type a free-form narrative of an adverse event, and the AI extracts the structured fields the form requires. Lower risk profile than Ask Optima — the nurse reviews before submission — but the same evaluation discipline applies.

These are two of the AI products RLDatix is shipping. They aren't the only two. And every one of them was being held back by the same gap: manual, slow, expensive testing for them all. That gap is what forced the conversation on the platform level.

THE PROBLEM

A patchwork couldn't support the standard RLDatix had set for itself

Before standardizing on Confident AI, each team shipping an AI feature was effectively reinventing the evaluation process from scratch. Antonio's team in North America had an agent written in Go, evaluated by hand against a README, with results pasted into an Excel sheet. Steve's team in the UK had monitoring telemetry but no automated way to find the signal in the noise — reviewing real responses meant sampling manually, which doesn't scale past a few dozen interactions.

For a company whose software touches patient care, that gap between what standard RLDatix wanted to hold and the tools available to enforce it was untenable. Three problems compounded as the AI roadmap expanded:

  • No adversarial coverage. As Steve puts it, most teams building AI features focus on the happy path — get the right questions in, get the right answers out. That's not a bar RLDatix could ship against. With software deployed across NHS Trusts and major North American health systems, adversarial testing had to be part of the evaluation discipline from day one.
  • No standardized quality bar across the enterprise. Every team was defining its own evaluation process, its own metrics, its own definition of "good enough to ship." There was no shared baseline of non-negotiable checks every AI product had to clear — and without one, "safe to ship" meant whatever the team decided that week.
  • No portfolio-wide view of AI health. RLDatix had multiple AI products in flight across the UK and North America, but no consistent way to see how any of them were behaving in production. Failure patterns, drift, cost, adversarial exposure — all siloed inside the team that owned each product. The CTO's office had no portfolio view.

"We hit a point where every AI team was building their own eval stack. That's fine for one product. With five, ten, fifteen AI initiatives across the portfolio, it's never going to live up to our high standards of AI governance."

Richard JarvisChief Technology Officer

THE SOLUTION

One platform to unify AI quality, trust, and safety across the enterprise

RLDatix evaluated the landscape and chose Confident AI as the standard evaluation, observability, and red teaming platform across its AI portfolio. The decision was deliberately a platform decision — not a tool any one product team picked for itself, but the foundation every team building AI features would build on.

Three capabilities made the difference.

  • Red teaming and observability built in, not bolted on. Confident AI made adversarial testing a default part of the eval cycle, and sits on top of live telemetry to flag low-scoring responses the moment they happen. "Having a really solid red team penetration test gives me a lot of good confidence that it's safe," Steve says. Issues get caught the moment they appear in production — not weeks later in a manual review.

  • Open-source, auditable evaluation. Confident AI is built on DeepEval, the most widely adopted open-source LLM evaluation framework. In healthcare, the way an AI is scored has to be inspectable, defensible, and reproducible — not a black box behind a vendor's API. When a regulator or a Trust asks how an AI feature was evaluated, RLDatix can show the work, line by line.

  • Every AI use case covered. Every persona empowered. Engineers get a clean SDK that drops into CI. Product managers and clinical domain experts run full evaluation cycles in the same platform — and annotate live responses to improve the agent over time. "We can assign evaluations to experts directly," Antonio says. A platform only engineers can use isn't a platform. It's a tool.

But capability is only half the story for a platform team. The other half is whether Confident AI can actually be deployed inside the enterprise. Confident AI is self-hosted on RLDatix's own AWS infrastructure, across multiple regions to satisfy the data residency and regulatory requirements that come with operating in healthcare across two continents. The first region was up in a week.

For RLDatix's platform team, the experience felt close to self-serve: spin up an environment, point engineers at it, replicate the deployment in the next region without re-running a months-long vendor engagement. Confident AI behaves like part of RLDatix's stack, not a dependency on someone else's.

"When I joined RLDatix, the team hadn’t got a strategic, platform-centric solution for AI governance. We saw Confident AI’s thought leadership and agreed - we're going with Confident AI. For healthcare in particular, you need evals that are inspectable, defensible, not a black box, but something you can actually trust. Coupled with their open-source [DeepEval], nothing else came close."

Richard JarvisChief Technology Officer

THE IMPACT

Trust, not throughput, is the metric that matters in healthcare AI

Confident AI fundamentally changed how RLDatix scales AI initiatives across the enterprise.

Instead of building an internal AI quality platform from scratch, RLDatix's platform team standardized on Confident AI as centralized infrastructure so new AI products no longer start from zero—they inherit an existing quality layer from day one.

For Richard, the productivity gains were never the headline. "It's all around trust in healthcare," he says. "It's not about whether I could remove people from an organization, ship code faster, or any of that. It's about will this tool be high quality enough to put into the healthcare system and be trusted."

Before Confident AI, every team had to assemble their own evaluation workflows, monitoring processes, governance practices, and tooling around quality assurance. They also had to define its own metrics, and its own definition of what "safe to ship" meant.

As the number of AI products expanded across the portfolio, that duplicated operational overhead would have compounded team by team. That model wasn't going to scale across a healthcare enterprise shipping AI products across multiple regions, products, and regulatory environments.

The operational change was immediate. Manual evaluations became automated and continuously running. Adversarial testing and production monitoring shifted from periodic manual review to always-on. And product managers and clinical domain experts could run evaluations themselves, instead of queuing behind engineering.

"Our mission is to raise the standard of care… everywhere. That’s a bold ambition and requires a partner that can scale with us and holds the same standard of quality in the AI that is deployed. We’re happy we’ve found them."

Richard JarvisChief Technology Officer

For RLDatix, AI evaluation, observability, and red teaming are no longer isolated processes managed team-by-team. They've become shared infrastructure every AI initiative builds on top of by default.