On June 3, Obin AI and Motive Partners hosted an executive AI panel during NY Tech Week. We had 40+ enterprise AI leaders in the room for a conversation around “The Hard-Won Lessons from Deploying AI Agents in Regulated Environments.”

Our panel, moderated by Lak Lakshmanan, CTO of Obin AI, brought together three operators who have lived this firsthand. Andrew "AJ" Lang, former Global CTO at JPMorgan Chase and now Board Director at Deutsche Bank, brings the operator's lens from one of the largest US banks and the governance lens from the board seat at a European Tier 1. Virender Bedi, Managing Director of Client & Product Solutions Technology at Apollo Global Management, is actively shipping production AI across capital formation, investing, and distribution at one of the world's largest asset managers. Toby Kovacs, Financial Services Director at Google Cloud, brings the platform provider's view from the seam between model providers and financial services.

Every large financial institution has an AI mandate. Operationalizing that mandate inside a regulated business is much harder than giving employees access to ChatGPT and Claude.

What follows are the lessons that stuck with us, drawn from both the discussion and from what we see playing out across the institutions we work with.

Production is a data and architecture problem more than a model problem

We can all agree that POCs work, but the hard part is achieving reliability and accuracy at scale, customizing to firm-specific workflows, and ensuring the traceability that regulators require. A business user with the right tooling can build a working artifact in about a day. Getting that same artifact to run reliably in production, with the right entitlements, security, scale, and audit trail, can take engineering months.

To understand the reason, we have to look beneath the model layer. Drawing from his experience at JPMorgan Chase, AJ Lang described how a seemingly simple conversational front end on a banking app pulls a thread that exposes mainframe batch processing, data currency problems, governance issues, and entitlements. The model is a small piece. Pipeline architecture, lineage cataloging, business metadata, and standardized data products are the rest of the work. We see the same dynamic at every institution we work with. The teams that treat the platform layer as a first-class investment ship.

Buy beta. Build alpha.

This was the single most quotable line of the night and the clearest articulation we've heard of the doctrine emerging across Tier 1 institutions. Virender Bedi started, "Our philosophy has been to buy beta and build alpha."

The build side is the judgment layer: portfolio management, security allocation, money movement, trading instructions. Anywhere proprietary firm IP creates the differentiation. The buy side is the commoditizing infrastructure: PDF, Excel, and Outlook plumbing, data movement, document extraction, and the Anti-Money Laundering (AML), sanction-screening, and transaction-monitoring solutions where industry-shared models improve as the network grows. There's no competitive advantage in building your own AML model, and there's a real cost to cutting yourself off from a shared model that gets better as more institutions feed it.

The debate went further. Even when you buy, should you keep your IP abstracted away from the vendor? Context, memory, and prompts should stay on your side of the line. Switching cost stays close to zero. The institutions getting this right run multiple model providers behind a common gateway, with business logic deliberately decoupled. The ones getting it wrong are locking themselves into vendor stacks they'll fight to leave in eighteen months.

The next hard skill is operational discipline

AJ drew the parallel every CIO in the room recognized – back at the start of the cloud era, companies blew their cloud budgets in the early years because the cost discipline hadn't been invented yet. Token ops will follow the same path. The half-million-dollar employee with a half-million-dollar annual token bill is a problem coming for every institution that doesn't get ahead of it.

Evaluation is an old discipline applied in a new context. First, you must build ground truth, then continuously compare models, because the one you validated last quarter may have been deprecated. Next, tune accuracy thresholds to the use case. 85% is fine for translation, but 98% is the floor for credit analysis. Keep a human in the loop where regulators care, because continuous validation can be agent-driven but final accountability cannot.

The most forward-looking institutions are building the gateway, routing logic, eval framework, and human-in-the-loop posture before they need them.

The orchestra of agents is already being built

When prompted about the future of the industry, the discussion converged on one main idea. Virender described a near-future in which institutions run hundreds of thousands of agents. A salesperson has a sales agent, and that sales agent sits on top of hundreds of capability-specific sub-agents. Agents get employee IDs. Agents get evaluated. Agents get fired.

AJ closed it with the metaphor that anchored the panel: humans become conductors. Agents sit in the violin section, the viola section, and the cello section. Each agent stays in its structure and its role. The violinist doesn't get up and start playing bass. But the human handles composition.

The orchestra is now being built within major institutions – it’s exciting, but also challenging, and no one can do it alone. Buy beta. Build alpha. You must think beyond the model, and when you start asking “which agents do my people manage?" you start to build real alpha.

‍