CASE STUDY

How Finom used Confident AI to cut agent improvement cycles from 10 days to 3 hours

27xFaster iteration cycles
€500K+Engineering costs saved
60+Hours saved weekly
10+AI use cases evaluated

Before Confident AI, a single improvement cycle took 10 days—I'd create a task, assign it to an engineer, wait for availability, and go back and forth. Now the same cycle takes three hours, and our product managers can run it themselves.

Finom: Igor Kolodkin
Igor KolodkinLead Data Scientist

THE COMPANY

Finom is building the financial home for Europe's entrepreneurs

Finom is a European fintech platform that combines business banking, invoicing, expense management, and AI-powered accounting into a single mobile-first product. Founded in 2019 and headquartered in Amsterdam, the company serves over 125,000 SMBs and freelancers across Germany, France, Italy, Spain, and the Netherlands—backed by more than €300M in total funding from investors including General Catalyst, AVP, and Northzone.

But Finom isn't just digitizing financial workflows—it's rethinking them entirely. The company is building AI agents that don't simply guide users through processes but execute tasks on their behalf: issuing cards, setting limits, managing accounting. As Chief AI Scientist Ivo Dimitrov puts it, "Our goal isn't just to improve existing processes—it's to rethink the whole thing. Making AI not just a copilot, but a decision-maker inside the process."

Finom's most ambitious bet is a new agentic system that operates both reactively—responding when a customer asks—and proactively, pushing relevant actions before the customer even thinks to request them. With sub-agents mapped to every domain from cards to invoicing, each connected to dedicated MCP servers and backend microservices, the stakes for getting AI right are exceptionally high.

THE BUILDUP

AI agents that handle real money need more than vibe checks

Building AI-powered financial products isn't the same as building a chatbot. When an agent can issue a credit card, adjust spending limits, or modify accounting records, the cost of getting it wrong isn't a bad user experience—it's financial harm.

Finom's AI team understood this pressure acutely. They were scaling across two parallel tracks: internal agents automating operational processes like risk engines and HR workflows, and user-facing products that would fundamentally change how customers interact with their finances. Both required deep observability into how agents behaved in practice—and a level of confidence in that behavior that casual testing couldn't provide.

As Igor explains, "As agents become more complex, it's hard to understand what they actually can do and what they can't. Observability helps us developers understand what the agent actually does, how it works internally, what latencies it has, what problems it has. We see where it fails—and we can fix it."

The team knew that evaluation had to come first—not as an afterthought, but as the foundation for every agent they shipped. The question was whether their existing tooling could keep pace.

THE PROBLEM

The engineer bottleneck was slowing everything down

Before Confident AI, Finom's evaluation workflow had a single point of failure: the engineer. Every change to a prompt, every new test case, every dataset update required a task to be created and assigned to a developer—even though most of this work was product work, not engineering work.

The team faced three compounding challenges:

  • Engineers owned the entire eval loop. Only developers could see what test cases existed, change prompts, update datasets, or add new metrics. Product managers with deep knowledge of customer intent and business context were locked out of the process entirely.

  • No unified observability across agents. With multiple AI products in development, the team had no consistent way to trace agent behavior, monitor latencies, or identify failure patterns across systems. Debugging meant digging through logs manually—if the logs existed at all.

  • Product people couldn't contribute where they mattered most. As Igor Kolodkin, Data Scientist at Finom, explains: "Creating good datasets that represent your users' intents—it's really hard work, and engineers don't know this part well. Only product can make a good estimation of what users would ask our agents to do."

  • Multi-turn testing was primitive. The team had multi-turn evaluations, but they were entirely predefined—scripted messages sent regardless of how the agent responded. This produced results that bore little resemblance to real conversations.

The result was a painful improvement cycle. A single iteration—from identifying an issue to testing a fix—took roughly ten days as tasks passed between product, engineering, and QA. Meanwhile, Finom was preparing to launch agents that could read and write real financial data on behalf of customers. The pace of testing couldn't match the ambition of the product.

"Without a platform and system around it, each AI product would need a dedicated engineer just to support the improvement cycle. With five to ten AI products inside Finom, that's roughly €500K in engineering cost."

Igor KolodkinLead Data Scientist

THE SOLUTION

A platform to unify engineering and product evaluation workflows

Finom evaluated several alternatives—including LangSmith and MLflow—before choosing Confident AI. Each fell short in different ways. MLflow was too technical, built for engineers but unusable for product managers. LangSmith had tighter integration with their LangChain stack but didn't solve the collaboration problem.

"We found several features in Confident AI that caught our attention—AI Connection and multi-turn simulation. These features were not present in any other competitor, and they were important for us," says Igor.

AI Connection allowed Finom to connect their live agent system to Confident AI with a single new endpoint—no need to recreate the agent somewhere else just to test it. This meant product managers were no longer evaluating isolated prompts. They were testing the entire system: tools, MCP servers, sub-agents, and all. "This is a game changer," says Igor. "Now when managers test the agent inside Confident AI, they're not testing one small part—they're testing the whole system."

Multi-turn simulation replaced Finom's rigid, scripted test conversations with dynamic simulations that adapted based on actual agent responses—closer to how real users interact with the product.

Beyond these, the platform gave both sides of the team a shared workspace. Engineers got DeepEval—a clean SDK that integrates with pytest and fits naturally into existing development workflows. Product managers got a UI where they could design evaluations, manage datasets, and review results without filing engineering tickets.

"Confident AI is a good product for engineers and also for non-engineers. Under the hood you have DeepEval—engineers love it. And on the other hand, you have a pretty good UI for non-engineers. It's two things in one."

Igor KolodkinLead Data Scientist

THE IMPACT

From 10-day cycles to 3-hour iterations—and product teams finally in the driver's seat

The most dramatic change was speed. Before Confident AI, a single improvement cycle—from identifying an issue to validating a fix—took approximately ten days as tasks moved between product managers and engineers. Now, the same cycle takes as little as three hours.

"Before, I create a task, I put it on the engineer, I wait for when he'll be ready to take it, then he makes a first version, I don't like it—it's 10 days," says Igor. "Now? Three hours."

This wasn't just about faster tooling—it was about removing the structural bottleneck that kept product teams dependent on engineering. Product managers can now run evaluations directly, using either the Confident AI platform or the DeepEval CLI—no engineering ticket required.

"Without Confident AI, each AI product would need a dedicated engineer full-time. Now, a manager can run evals on their own. That's the difference."

Igor KolodkinLead Data Scientist

Critically, those iterations run against the real agent—not a replica. Before Confident AI, testing a complex agentic system meant either painfully recreating it in a separate environment or settling for evaluating isolated prompts. With AI Connection, Finom added a single endpoint to their existing codebase, and every evaluation now runs end to end—tools, sub-agents, MCP servers, and all.

"We're not just testing one small part," says Igor. "We're testing the whole system. That's what you need when you have agents with tools, with MCPs, with sub-agents and everything."

That's the real shift: an end-to-end iteration cycle where product managers identify an issue, adjust the evaluation, test the full agent, and validate the fix—all within hours, all without recreating anything.

The financial impact is equally clear. Without Confident AI, Finom estimates it would need one dedicated engineer per AI product just to maintain the evaluation workflow. Across five to ten AI products, that translates to roughly €500K in annual engineering costs—resources now freed to focus on building the agents themselves.

For a company preparing to launch AI agents that can execute real financial transactions on behalf of customers, the ability to evaluate and observe those agents rigorously—across every turn, every tool call, every latency spike—isn't a nice-to-have. It's what makes shipping possible.

How Finom used Confident AI to cut agent improvement cycles from 10 days to 3 hours