CASE STUDY

How Finom used Confident AI to cut agent improvement cycles from 10 days to 3 hours

27xFaster iteration cycles

€250K+Savings projected

60+Hours saved weekly

3xIteration throughput

“Before Confident AI, a single improvement cycle took 10 days—I'd create a task, assign it to an engineer, wait for availability, and go back and forth. Now the same cycle takes three hours, and our product managers can run it themselves.”

Igor KolodkinHead of AI Quality

THE COMPANY

Finom is building the financial home for Europe's entrepreneurs

Finom is the leading European fintech platform that combines business banking, invoicing, expense management, and AI-powered accounting into a single mobile-first product. Founded in 2020 and headquartered in Amsterdam, the company serves over 200,000 SMBs and freelancers across Germany, France, Italy, Spain, and the Netherlands—backed by more than €300M in total funding from investors including General Catalyst, AVP, and Northzone.

But Finom isn't just digitizing financial workflows—it's rethinking them entirely. The company is building AI agents that don't simply guide users through processes but execute tasks on their behalf: issuing cards, setting limits, managing accounting. As Chief AI Officer Ivo Dimitrov puts it, "Our goal isn't just to improve existing processes—it's to rethink the whole thing. Making AI not just a copilot, but a decision-maker inside the process."

Finom's most ambitious bet is a new agentic system that operates both reactively—responding when a customer asks—and proactively, pushing relevant actions before the customer even thinks to request them. With sub-agents mapped to every domain from cards to invoicing, each connected to dedicated MCP servers and backend microservices, the stakes for getting AI right are exceptionally high.

THE BUILDUP

AI agents that handle real money need more than vibe checks

Building AI-powered financial products isn't the same as building a chatbot. When an agent can issue a credit card, adjust spending limits, or modify accounting records, the cost of getting it wrong isn't a bad user experience—it's financial harm.

Finom's AI team understood this pressure acutely. They were scaling across two parallel tracks: internal agents automating operational processes like risk engines and HR workflows, and user-facing products that would fundamentally change how customers interact with their finances. Both required deep observability into how agents behaved in practice—and a level of confidence in that behavior that casual testing couldn't provide.

As Igor Kolodkin, Head of AI Quality at Finom, explains, "As agents become more complex, it's hard to understand what they do, how, and why. Observability lets us see how the agent works internally—latencies, failure patterns, edge cases. We see where it fails, and we can fix it."

The team knew that evaluation had to come first—not as an afterthought, but as the foundation for every agent they shipped. The question was whether their existing tooling could keep pace.

THE PROBLEM

The engineer bottleneck was slowing everything down

Before Confident AI, Finom's evaluation workflow relied heavily on engineers as the main entry point for changes. Updates to prompts, test cases, or datasets typically required creating and assigning tasks to developers—even though much of this work was closer to product iteration than core engineering.

The team faced three compounding challenges:

Engineers owned the entire eval loop. Only developers could see what test cases existed, change prompts, update datasets, or add new metrics. Product managers with deep knowledge of customer intent and business context were locked out of the process entirely.
No unified observability across agents. With multiple AI products in development, the team had no consistent way to trace agent behavior, monitor latencies, or identify failure patterns across systems. Debugging meant digging through logs manually—if the logs existed at all.
Product people couldn't contribute where they mattered most. As Igor Kolodkin, Head of AI Quality at Finom, explains: "Creating good datasets that represent your users' intents—it's really hard work, and engineers don't know this part well. Only product can make a good estimation of what users would ask our agents to do."
Multi-turn testing was primitive. Existing evaluations couldn't capture the dynamic back-and-forth of real user conversations, which meant real failure modes went undetected.

The result was a painful improvement cycle. A single iteration—from identifying an issue to testing a fix—took roughly ten days as tasks passed between product, engineering, and QA. Meanwhile, Finom was preparing to launch agents that could read and write real financial data on behalf of customers. The pace of testing couldn't match the ambition of the product.

"Without Confident AI, each active AI product needs dedicated engineering hours every week just for the prompt improvement cycle. With five to ten AI products inside Finom, that's over €250K in projected engineering costs."

Igor KolodkinHead of AI Quality

THE SOLUTION

A platform to unify engineering and product evaluation workflows

Finom evaluated several alternatives—including LangSmith and MLflow—before choosing Confident AI. Each fell short in different ways. MLflow was too technical, built for engineers but unusable for product managers. LangSmith had tighter integration with their LangChain stack but didn't solve the collaboration problem.

"We found several features in Confident AI that caught our attention—AI Connection and multi-turn simulation. These features were not present in any other competitor, and they were important for us," says Igor.

AI Connection allowed Finom to connect their live agent system to Confident AI with a single new endpoint—no need to recreate the agent somewhere else just to test it. This meant product managers were no longer evaluating isolated prompts. They were testing the entire system: tools, MCP servers, sub-agents, and all. "This is a game changer," says Igor. "Now when managers test the agent inside Confident AI, they're not testing one small part—they're testing the whole system."

Multi-turn simulation replaced Finom's rigid, scripted test conversations with dynamic simulations that adapted based on actual agent responses—closer to how real users interact with the product.

Beyond these, the platform gave both sides of the team a shared workspace. Engineers got DeepEval—a clean SDK that integrates with pytest and fits naturally into existing development workflows. Product managers got a UI where they could design evaluations, manage datasets, and review results without filing engineering tickets.

"Confident AI bridges two worlds. Engineers get DeepEval—a clean SDK that fits right into pytest and CI pipelines. Product managers and data scientists get a UI and MCP where they can design evaluations, manage datasets, and see results without touching code. That's what made adoption work across both sides of the team."

Igor KolodkinHead of AI Quality

THE IMPACT

From 10-day cycles to 3-hour iterations—and product teams finally in the driver's seat

The most dramatic change was speed. Before Confident AI, a single improvement cycle—from identifying an issue to validating a fix—took anywhere from 3 to 10 days. Most of that time wasn't engineering work: it was waiting in the queue. An engineer's hands-on time per iteration was about half a day, but the end-to-end wall clock stretched to a week or more as tasks moved between product, engineering, and QA. Now, the same cycle takes 1 to 3 hours.

"Before, I create a task, I put it on the engineer, I wait for when he'll be ready to take it, then he makes a first version, then we iterate—it's 10 days," says Igor. "Now? Three hours."

This wasn't just about faster tooling—it was about removing the structural bottleneck that kept product teams dependent on engineering. Product managers can now run evaluations directly, using either the Confident AI platform or the DeepEval CLI—no engineering ticket required. Because product managers, data scientists, and QA engineers can now run evaluations independently, the team's total iteration throughput has tripled. More iterations mean more experiments, faster learning, and higher-quality agents.

"Without Confident AI, each AI product would need a dedicated engineer full-time. Now, a manager can run evals on their own. That's the difference."

Igor KolodkinHead of AI Quality

Critically, those iterations run against the real agent—not a replica. Before Confident AI, testing a complex agentic system meant either painfully recreating it in a separate environment or settling for evaluating isolated prompts. With AI Connection, Finom added a single endpoint to their existing codebase, and every evaluation now runs end to end—tools, sub-agents, MCP servers, and all.

"We're not just testing one small part," says Igor. "We're testing the whole system. That's what you need when you have agents with tools, with MCPs, with sub-agents and everything."

That's the real shift: an end-to-end iteration cycle where product managers identify an issue, adjust the evaluation, test the full agent, and validate the fix—all within hours, all without recreating anything.

Finom estimates the platform will save over €250,000 in engineering costs in 2026. The math is straightforward: each iteration that previously required an AI engineer's time—at current compensation levels—now runs without engineering involvement. As the company scales the number of active AI products, the freed engineering capacity compounds. That's capacity redirected from routine prompt iteration to building the next generation of agentic systems.

For a company preparing to launch AI agents that execute real financial transactions, the ability to evaluate rigorously—and to iterate fast enough that quality improves week over week, not quarter over quarter—is what makes shipping possible. With an estimated €250K in engineering savings projected for 2026 and a growing number of AI products coming online, the platform has become core infrastructure for Finom's AI ambitions.

Get started today.

Request a Demo Try Now For Free