The platform and the people you need to build Vertical AI

January 11, 2026
Why do most AI pilots in finance fail while a few actually scale? This post uncovers the hidden infrastructure and domain expertise needed to move from impressive demos to reliable, production-ready AI, and explains why finance demands far more than “mostly correct” models or better prompts.

Why AI in Finance requires Specialized Infrastructure and an interdisciplinary team

In my experience operating at the intersection of engineering and finance, I’ve seen a striking and consistent pattern in how organizations approach Generative AI. While adoption is everywhere, the journey from a promising pilot to full-scale production remains a treacherous path for most. A recent study by MIT researchers throws this into stark relief: a crushing 95% of internal corporate GenAI pilots crash and burn before hitting production. Contrast this with the initiatives led by specialized vertical AI partners, who boast a more robust success rate of approximately 67%.

This massive disparity isn’t usually about a lack of talent or budget. Financial institutions hire some of the sharpest engineers on the planet and pay them extremely well. Instead, the evidence is that this is a structural problem.

This article is my hypothesis for why that gap exists and, crucially, how to span it. It’s informed by my time building the first generation of enterprise AI solutions at Google, then helping multiple companies in a Private Equity firm’s portfolio adopt AI, and now as the co-founder of obin.ai, where we’re building cutting-edge AI agents for global asset managers.

And yes, it’s also a bit of a mea culpa. When Hannes and I wrote the book Generative AI Design Patterns, we cataloged common, repeatable solutions for the thorny problems that pop up when building GenAI applications and agents. We wrote that book primarily for engineers, but I’ve since had a significant epiphany: to successfully wield the technical solutions in that book — whether controlling an agent’s tone or implementing robust guardrails — you also need deep, intrinsic domain expertise.

Why pilots stall: it’s not a technology problem

The most challenging phase of an AI project is the move from a successful prototype to something robust enough and reliable enough to deploy into production. This is, however, not a MLOps problem — and so, the solution is not technological. You can not buy an agentic platform that will enable you to productionize your pilots. Sorry.

It is relatively straightforward to build a prototype that achieves 80% accuracy on standard tasks, such as summarizing a clear earnings call or extracting data from a standard invoice. In many software contexts, this is a successful beta. You can create a product backlog and methodically address the remaining bugs and missing features. In financial workflows, however, this remaining 20% gap renders the tool unusable from the get-go. You will struggle to get any adoption.

Pilots frequently stall, struggling to get adoption, at this “80% Wall” due to two specific characteristics of financial data: aggregation-dependent use cases and the long tail.

Press enter or click to view image in full sizeImage generated by author using Gemini (nanobanana)


Reason 1: Errors compound because finance use cases involve aggregation

In consumer applications, an error is usually an isolated event. If a chatbot gets the answer to a question wrong, it affects just that one user. A bot that is 90% correct is serving 90% of users well. Moreover, you can usually detect what types of questions the bot is getting wrong (using implicit and explicit human feedback) and improve the product over time to get higher and higher completion rates.

Financial workflows, however, often rely on aggregation. Consider a credit risk workflow that requires extracting data from 50 different loan documents to calculate a single covenant ratio.

Because a single task involves 50 intermediate answers to be correct, the math of probability works against a model that is merely “good.” If an agent is 97% accurate on extracting individual line items, only one-fifth of covenant ratios will be correct ($0.97^{50} \approx 0.22$). If 80% of AI-generated answers are wrong (and it’s unclear which 20% are correct), users have to validate every single answer. Why would they use the system in the first place? This lack of adoption cuts off the avenue (human feedback) that consumer applications leverage to get better over time.

For an agent to be valuable in this context, it cannot just be “mostly right.” It requires an architecture designed for near-perfect extraction accuracy across the entire chain, or the final output cannot be trusted for decision-making. Since perfect accuracy is impossible (see the next section about the long tail), you need to also track errors and present a confidence interval for that covenant ratio. You might also want to allow users to audit results and correct the documents that contribute the most to that error. This requires a combination of domain expertise, mathematical sophistication, and software engineering skill.


Reason 2: The long tail of edge cases leads to loss of trust

The second hurdle is that the happy path doesn’t have much value. Unlike in call centers, where 80% of calls might be to just get store hours and such easy questions, the point of enterprise workflows might be to already deal with exceptions — your easy loan applications are already being handled through rules engines and ML models — it’s the nuanced cases that end up in the workflow.

It is common for engineering teams to test models using the standard data available to them — clean ISDA agreements or standard 10-Ks. As discussed, however, the value in finance often resides in the long tail. For example, a distressed credit clause that contradicts standard terms, or a corporate action involving a simultaneous spin-off and merger, or a loan notice that presents a partial PIK payment, or a cross-border transaction subject to conflicting regulatory jurisdictions.

Generalist engineering teams often treat these as “bugs” to be fixed later. However, in finance, these are not bugs; they are the domain. They are its raison d’etre — this is why you have private credit, M&A efficiencies, the flexibility of PIK, or cross-border commerce. You cannot simply prompt your way out of the long tail. It requires architectural intervention with a keen understanding of why these edge cases exist. However, because any single one of these edge cases appears infrequently, the AI systems often pass the initial validation of a pilot, only to fail when deployed against the complexity of live markets.

Combine the long tail with the fact that many finance use cases involve aggregation, and these failures lead to a loss of trust. Adoption suffers. And hence, you are stuck in the land of successful pilots that end-users don’t use.

To successfully address these characteristics, you need the right platform and the right team. Let’s start with the platform, and then we’ll talk about the team.


The platform you need: evaluators and sandboxes

To understand the platform you will need in order to be successful in finance, it is helpful to look at the domain where agents are currently most successful: software engineering.

Tools like Claude Code have rapidly transformed software development, but not simply because the underlying models (like Claude Sonnet or Claude Opus) are very good at Python or JavaScript. They succeed because the software engineering ecosystem provides two things that the financial ecosystem lacks: Deterministic Validation and Context Management.


When an agent like Claude Code spits out a few lines of Python, it’s not operating in a vacuum — it’s locked into a high-speed, deterministic feedback loop. Instantly, it runs a compiler in a sandbox to find syntax errors, a linter to enforce style, and a test suite to verify the logic. If it fails, the environment screams an explicit error: “Extra indentation on line 10!” or “Unreachable code in line 10” or “Test 10 expected A, but got B!” The agent receives this crystal-clear, descriptive feedback, self-corrects, and tries again. Ironclad, speedy validation is what makes the agentic loop fly.

There is no context management in finance. In software, tools like git were designed explicitly to support software engineers working in parallel on the same code base. They leverage worktrees, pull requests, and integrated test suites to ensure that changes are safe. All the necessary code and deployment configurations are available in a single repository or cloud environment. AI agents can plug into this context that is rich, integrated, and still safe to experiment within.

Finance doesn’t have git. It has … Excel.

Finance lacks native tools for verification, validation, and experimentation. There is no terminal utility that flashes red when an agent misinterprets a swift message, misses a zero in a spreadsheet, or conflates “authorized capital” with “issued capital.” There is no shared context because the workflow involves multiple systems that are siloed (sometimes for valid reasons, such as regulatory compliance or because they belong to different organizations operating in a market). Changes are not necessarily safe.

Because this feedback mechanism is missing, the single biggest hurdle to production is not building the agent — instead, you need to build a platform that can run the agent in isolation and grade the agent’s work. Let’s start with grading the agent first, then we’ll look at how to run it in a secure sandbox for that grading.

Press enter or click to view image in full sizeImage generated by author using Gemini (nanobanana)

Grading agents in a domain-specific way

Lacking deterministic tools, many engineering teams rely on a pattern known as “LLM-as-a-Judge.” This involves asking a second, larger model (like GPT-4) to evaluate the output of the first model.

In my book on Gen AI Design Patterns, I discuss the utility of this approach for creative tasks. If you are summarizing news sentiment, an LLM judge is sufficient.

However, in financial workflows, this is effectively nothing more than a vibes check. It assesses fluency, not factual accuracy. An LLM judge might confirm that a risk memo reads professionally, but it cannot reliably verify if the EBITDA add-backs adhere to the specific definitions in the credit agreement. Relying on a probabilistic model to grade another probabilistic model introduces a circular dependency that auditors and end-users will not accept.

To move beyond the vibes check, an engineering team must build what I call a “Financial Compiler”, a deterministic evaluation engine that validates agent outputs against ground truth.

This is not AI, although it can involve machine learning. It is usually rigid, rules-based code designed to constrain the AI. It involves applying integrity checks (such as ensuring that $Assets = Liabilities + Equity$ in every step), cross-referencing (such as verifying that the maturity date in the trade ticket matches the original loan documentation), and sanity checks (such as hard-coding limits (e.g., LTV ratios) that flag any output falling outside realistic bounds).

An isolated context

Grading has to be done while the agent operates, to ensure that the agents are operating in legal and allowed ways (and not jumping to answers using information they are not authorized to know).

To grade agents while they work, you have to engineer a synthetic context layer. This is a virtualized environment (a sandbox) where the agent can aggregate these fragmented data sources and “work” on them without touching the live wire of your production systems.

The Infrastructure Debt

The “hidden cost” that sinks many enterprise GenAI pilots is that building this evaluation infrastructure is often more expensive than building the agent itself. Horizontal evaluation tools (such as tracers, loggers, dashboards, LLMs for RAGAS metrics, etc.) can accelerate your ability to build graders, but you still have to put in the hard work of building the domain-specific graders.

I find that many teams scope for the model, but they rarely budget for the evaluator. Even if they budget for evaluators, they almost never budget for the isolated system in which to run it. Consequently, when the pilot finishes, the team has a tool that works 80% of the time, but no automated way to prove which 20% is wrong. Without that proof, Risk and Compliance committees inevitably (and correctly!) block the deployment.

The team you need: tightly coupled domain experts and engineers

When engineering teams jump into an AI pilot, they typically start with the Standard Operating Procedures (SOPs). The logic is simple, almost tragically so: “We’ve got a document telling our analysts how to do the task. We just need to feed that gospel to the model.”

I have yet to witness an SOP that is actually followed, line-by-line, in the real world. SOPs are peace treaties written by committee, not marching orders for machines. They map out the happy path, the smooth, standard procedure for a perfectly standard Tuesday. They are propped up by the invisible scaffolding of common sense and “tribal knowledge” that every employee absorbs by osmosis. Historically, when a curveball hits (a contradiction in the data, a foggy clause) the human operator resolves it through common sense, intuition, or leaning over and asking a neighbor.

But agents? They’re stone-cold literalists. They lack that divine spark of intuition. To be worth their salt, they don’t just need to spot the edge cases; they must reason through them with the surgical nuance of your most senior analyst. And it gets messier. During a deal’s life, your firm’s decision-making flow changes — you manage this with an old-school escalation: “Any deal touching Company A? That’s a ‘Go talk to Bob’ situation.” Your agent won’t bother Bob, because the SOP is utterly silent on the unwritten, high-stakes rule: you don’t make decisions on deals involving companies that are huge limited partners in your firm without getting the fundraising team’s blessing.

Press enter or click to view image in full sizeImage generated by author using Gemini (nanobanana)

The Art of Spec-Driven Development

Bridging the gap between a rigid SOP and messy reality requires a discipline that software engineers have taken to calling “Spec-Driven Development” — this involves giving the AI agent a clear plan (or asking it to draft a plan and having the human operator validate the plan), and then having the AI agent knock off the steps in the plan one-by-one. Agentic harnesses like Google Antigravity default to spec-driven development, which is one of the reasons why they are so successful in software development.

This is not just prompt engineering. It is not about finding the right adjectives or phrases to cajole the model into behaving. It is not just context engineering. It is not about finding the right examples and processes that apply to this situation (although that is important). It is an architectural process of breaking down a nebulous financial workflow into discrete, machine-verifiable steps.

It requires a rare, hybrid skill set. You need the domain expertise to understand why a specific regulation applies, and the engineering capability to translate that regulation into a logical constraint the model can respect. In the software industry, we are seeing a clear difference start to emerge between people who can do this effectively (and who, therefore, have several AI agents working on multiple things for them simultaneously) and others who are stuck using the coding assistants as merely thought completers. This is in spite of the fact that, in software engineering, the software engineer is simultaneously a domain expert and knowledgeable about AI. This hybrid skillset is even more difficult to achieve in finance.

You need divergent thinkers

Generalist data science teams often search for the silver bullet, the one prompt or the one foundation model update that will solve the workflow. In our experience building vertical agents, the solution is never one big idea. It is usually fifteen small ideas layered together.

For example, getting an agent to correctly process a complex loan paydown isn’t about writing a better instruction. It involves a mix of domain insight and technical adjustments:

  1. Logic Split: “Because paydowns function differently in delayed-draw term loans, we must route this specific request to a sub-agent specialized in DDTLs.”
  2. Data Engineering: “The model struggles when a delayed draw is represented as a negative interest accrual, so we need a post-validator for this situation.”
  3. AI technique: “We have to be able to associate the draw with the correct fund it applies to, and in the absence of a CUSIP, we can still do that because we can train an ML model to use attributes A, B, and C from the notice to pinpoint the relevant deal.”

An engineer without domain knowledge won’t know to look for the “Delayed Draw” distinction. A banker without engineering skills won’t know that post-validators or machine learning are options.

This explains why even the most talented teams struggle. In most large institutions, the “Subject Matter Experts” (bankers/traders) and the “Engineers” sit in different buildings, or at least different org charts.

Information is passed over the wall via requirements documents. But these are busy people. Coordination problems abound. You cannot document your way through the long tail of edge cases. Instead, the iteration loop needs to be instant.

Vertical AI teams are purpose-built to collapse this silo. We operate in tight, interdisciplinary loops where the person understanding the domain and the person writing the code are often pair-programming (or are the same person). This allows us to implement those “fifteen ideas” in days, whereas a siloed organization might take months to simply identify the root cause of the error.

You also have to hire for, and train for, this hybrid skillset. In people, this ability to connect ideas, explore ambiguities, and recognize nuances is called divergent thinking. Divergent thinkers have a richer, multi-faceted understanding rather than just one “right” answer, especially in open-ended situations. Many engineers fall in love with their solution and go deep into one. What you need in an AI team are people who fall in love with the problem, and can explore multiple candidate solutions at once.

The Strategic Case for Vertical AI

If you are a financial firm, why do you need vertical AI solutions? Arming your employees with ChatGPT and Gemini will make them more productive, and so you should do it. However, these are good for one-off tasks. For systematic improvement of outcomes, you need standardization of how AI handles the work that your firm does. If you are considering using AI agents to improve outcomes or add scale, you need vertical AI.

I believe in Vertical AI enough that I left a comfortable job to co-found obin.ai. So, this section is me making a case for why you should use a firm like ours. My sales pitch, if you will.

If you are an executive at a finance firm, the decision to deploy AI agents is not a binary choice between “Build” or “Buy.” It is a strategic decision about where to allocate engineering leverage in a field that is still in its infancy.

The reality is that Agentic Engineering is a new discipline. No organization — whether a tech giant or a specialized startup — has a fully finished “Financial Compiler” or a complete library of every possible edge case today. We are all building the map as we walk the terrain.

However, the trajectory of improvement matters. Partnering with a Vertical AI specialist allows a financial institution to turn a fixed cost into a shared, appreciating asset. How?

1. The Economy of Scale in Infrastructure

Building the deterministic evaluation harness and isolated context described in this paper is a massive undertaking. The evaluation harness requires thousands of hours of coding rules, regulatory logic, and accounting identities. The isolated context is easier if you do it in an environment that is, by nature, separate from your production one, so that the agents are only reading inputs and writing the outputs from successful runs to your production databases.

Crucially, this is undifferentiated heavy lifting. A linter that checks if $Assets = Liabilities + Equity$ is identical for every asset manager.

By building this platform (as opposed to differentiated use cases) in-house, you are bearing 100% of the cost for a non-proprietary capability. By partnering with a vertical firm, you are tapping into a shared infrastructure that improves with every new client. As the landscape matures, our evaluation engine becomes more robust, and that reliability is passed downstream to you without your team needing to write a single new test.

2. Aggregated Learning on the Long Tail

The “15 ideas” required to solve complex edge cases are rarely obvious from day one. They are discovered through friction — by encountering the weird loan clause or the contradictory data point in production.

Any team only learns from its own friction. If they encounter a specific edge case once a year, they may treat it as a one-off anomaly. A Vertical AI firm, however, sees that edge case across dozens of clients. We operate as a compound learning engine. When we solve a complex corporate action logic for Client A, the architectural improvement is immediately available to Client B. You are effectively hiring a system that gets smarter not just from your own data, but from the aggregated complexity of the market.

Crucially, you get this learning even though actual data is kept strictly confidential. The fact that there are sometimes partial PIK payments is not confidential, which companies are currently PIK-ing is. So, you get the benefits of a shared ecosystem while preserving confidentiality.

3. Focus in a Volatile Landscape

The technical “Art” of this field is changing weekly. New foundation models, new RAG techniques, and new reasoning frameworks are released at a breakneck pace.

A Vertical AI company has the luxury of singular focus. Our only job is to absorb the complexity of the evolving AI landscape and translate it into domain-specific performance. We act as an abstraction layer, allowing your team to focus on business strategy while we handle the volatility of the underlying tech.

Summary

The transition to Agentic AI is a fundamental shift in how financial work gets done. It promises to automate not just tasks, but entire workflows.

However, the gap between a promising pilot and a production asset is filled with hidden complexity. To cross that chasm, financial leaders need to recognize that they cannot simply “prompt” their way to reliability. It requires domain-specific agent infrastructure and a culture of interdisciplinary engineering.