How many AI Agents do you have?

If the answer is “a lot,” Ramp’s engineers will tell you: that’s a trap.

Ramp is America’s fastest-growing enterprise finance platform, valued at $32 billion with over 50,000 customers and more than $100 billion in annual transaction volume. At The Pragmatic Summit, they sent a four-person team to share their AI实战经验: EVP of Engineering Nik Koblov, Director of Applied AI Viral Patel, and two Staff Engineers, Will Koh and Ian Tracey.

They covered five core topics: why their architecture converges from “a bunch of agents” to “one Agent + a thousand skills”; the complete journey of Policy Agent from zero to launch; how to define “correct” for AI; how to build internal AI infrastructure; and Ramp’s internal coding Agent that generated over 50% of PRs that month.

The Cost of One Cup of Coffee: How Ramp AI Agent Tackles Enterprise Finance Costs

Nik Koblov opened with a scenario everyone can understand: a cup of coffee.

Buying coffee, in a traditional process, takes about 15 minutes of administrative time. Writing notes, categorizing by accounting rules, finding receipts, attaching receipts, normalizing merchant names to the company’s merchant database. These things seem small individually, but they accumulate into a massive hidden cost across an organization.

The simplest way to understand what Ramp does is automation of all this. From card swipes to notes to categorization to receipts, all handled automatically by an Agent. This is something Ramp started doing about three years ago, initially using AI for single-step processing like normalizing merchant names and auto-writing notes. As model capabilities improved, the results got better too.

But that’s just the starting point. Almost every role on the Ramp platform does significant manual work: AP specialists processing invoices, finance teams reconciling accounts, procurement teams comparing prices, data teams running reports. Nik mentioned a telling detail: Ramp used to have a Slack channel called help-data, where people would post data requests and some poor soul would write SQL queries. That channel was replaced by AI about a year and a half ago.

From card spending and procurement to close and analysis, Ramp is trying to hand off all these fragmented tasks scattered across teams to an Agent.

Nik said Ramp is experiencing the most exciting paradigm shift in the software industry around AI. This shift demands completely rethinking things, but it also means simplifying the tech stack.

Converge Agents, Expand Skills

Ramp’s architectural judgment is straightforward: converge agents, expand skills.

Last year, Ramp intentionally let teams experiment freely. What they found was a proliferation of implementations across the company, with sync agents and background agents each having multiple versions. The conversational interface count ballooned to five.

Now Ramp has consolidated all conversational interactions into a unified interface called Omnihat. Omni means “everywhere,” and it’s being deployed across every product surface. It works alongside traditional UX, because you don’t always want to “talk” to software—sometimes tables and buttons are enough.

Nik showed an example: typing “help me onboard a new employee” in Omnihat. The Agent automatically parses the employee ID, queries the organizational structure through the HRIS (Human Resources Information System), finds a previously created workflow called “New Employee Onboarding Guide,” and asks if you want to use this process.

Underlying this is a lightweight Agent framework Ramp built in-house, providing orchestration capabilities and tools. Engineers can build new tools very quickly. Recently, a product manager built about 20 tools through vibe coding without any engineering involvement.

A unified conversational entry point, lightweight framework, and tool layer form what Ramp calls “one Agent + a thousand skills.”

For complex processes like employee onboarding with four steps—issuing cards, setting receipt requirements, welcoming on Slack, following up two weeks later—users can describe the desired workflow in natural language on Ramp. The system compiles it into a runnable deterministic workflow, then hands it off to the Agent.

Policy Agent: Start with the Minimum Viable Product

Ramp’s Policy Agent is one of their most popular Agent products.

Viral Patel took over, first showing a real scenario: the finance team reviews hundreds to thousands of receipts daily. He pulled out a receipt and said if asked to judge with his own eyes whether this transaction should be approved or rejected, he’d most likely get it wrong.

But Policy Agent reasoned through the receipt: identifying 8 guests (Viral said he could barely make out the number himself), confirming it was below the $80 per-person internal limit, judging this was a team welcome dinner, and recommending approval. Another transaction at OpenAI, where an employee was testing ChatGPT features, was judged a legitimate business expense and approved. A $3 bakery purchase was rejected because it wasn’t an overtime purchase and wasn’t on a weekend.

The product philosophy behind these cases came from an opportunity: a Fortune 500 client approached Ramp with a long list of rules and said, “Help us approve and reject according to these.”

Policy Agent doesn’t just read rules—it also coordinates with other Agents like accounting and bookkeeping to actually land decisions in the system.

Ramp could have continued down the old path, hardcoding these rules as deterministic logic, adding them to the product one by one. But they chose a different direction. Viral cited Andrej Karpathy: English is the hottest new programming language. They decided to turn the expense policy document itself into the rules, letting the Agent directly understand natural language policy documents and execute based on them.

Viral and Will Koh both emphasized a core lesson:

AI products can’t be done in one step. You have to start with something simple.

They initially wanted to go big: “Let’s automate the entire finance review.” But when actually starting, they chose the smallest切入点: coffee expense reimbursement. These single low-value transactions have low risk and the finance team doesn’t care much about them.

Ramp pushed the first step to its minimum: solve one category of high-frequency, low-risk, but annoying approval scenarios.

Context Engineering: What Works Better Than Switching Models

Ramp did extensive dogfooding internally, training and testing Policy Agent with their own transaction data.

Policy Agent’s biggest source of errors isn’t the model itself, but the insufficient context we provide to the LLM.

They could have sat down upfront to figure out all the context they needed, but reality didn’t allow this. For example, they discovered that an employee’s level and title significantly impacts expense approvals. C-level executives may have higher spending limits, can fly business class on trips. Without this information in the context, the Agent makes wrong judgments.

So they started extracting more information from receipts, pulling employee profiles from HRIS systems. Each time they discovered a new context dimension, they added it and observed the effect.

Will described three evolution stages of Policy Agent’s architecture in detail.

Stage One: Simple Pipeline. Straightforward: expense comes in, retrieve relevant context, make “is it compliant” judgments through a series of well-defined LLM calls, output results.

Stage Two: Conditional Branching. Conditional branching emerged. They found that each expense type differs significantly: travel, dining, entertainment each have different judgment logic. So they added expense classification, conditional prompts and context retrieval based on type, while giving the LLM tools to autonomously decide “I need to check flight information” or “I need to check this employee’s level.”

Stage Three: Full Agentification. A fully Agentified workflow. The Agent can now read all data on the Ramp platform, with a shared internal toolkit. It doesn’t just read anymore—it writes: approval decisions, reasoning processes, directly approving expenses on behalf of users. And it works in a loop, calling tools multiple times, gathering information, making judgments.

This slide is essentially Ramp’s methodology: start with a simple pipeline, then gradually add conditional branching, tools, and autonomous loops.

Capability went up, autonomy went up, but traceability and explainability went down. A small black box became a big black box. You can look at the reasoning tokens, but essentially you can’t control what it does.

Since the black box is growing, auditability becomes especially important. Will proposed a principle: even if you know how the system works internally, assume you can only see inputs and outputs, then verify whether outputs are correct.

Ground Truth: User Actions Are Not the Standard Answer

Initially, they thought user behavior was the standard answer: if the user approved, the Agent should approve; if the user rejected, the Agent should reject. But in reality, many users don’t know the company’s expense policy, some trust subordinates too much to bother checking details, some click through approvals on weekends without looking. The finance team would later say “this one shouldn’t have been approved.”

So Ramp had to establish their own definition of correctness. Their approach was to build a cross-functional annotation process, using human-annotated data to define ground truth.

Ramp removed “user actions” from the answer set, instead using cross-functional annotation to define their own ground truth.

This brought two benefits. First, they had a continuously testable benchmark dataset, confirming these annotations were correct. Second, everyone aligned their understanding. If the Agent was wrong, everyone knew where; if the Agent was missing context, everyone knew what was missing. Communication costs decreased, and the team could quickly focus on real priorities.

But gathering people weekly to annotate 100 data points is expensive. Will said everyone had their own things to do, sometimes showing up to meetings without finishing their “homework.”

So they wanted to make the annotation process as simple as possible. They looked at third-party tools first, finding some too specialized and others too generic—trying different tools alone took several weeks. So they decided to build their own.

Using Claude Code and Streamlit, they essentially built the entire annotation tool in one shot. The biggest benefit is low maintenance cost and low risk. It’s in an independent corner of the codebase, and if it breaks, they can fix it immediately. Deployment is almost instantaneous. Non-engineers can also modify it, directly vibe coding.

With a benchmark dataset in place, iteration speed increased dramatically. Need employee level information? Add it, run through the dataset, see if it correctly captures the new scenario.

Will said everyone knows the concept of eval now, but he wanted to emphasize “do it early.” Don’t pursue perfection, don’t need 1,000 data points upfront. They started with 5, made sure those 5 were absolutely correct, then kept accumulating.

A few practical points: make sure anyone can easily run the eval command, make sure results are clear at a glance (good or bad, immediately obvious), integrate it into CI so it runs automatically on every code merge.

Giving LLMs more context and more tools often brings unexpected side effects. Context rot (where excessive context degrades model performance), vague or contradictory tool descriptions—these issues only surface through eval.

Online eval also has value. They have an “uncertain” decision type, meaning the Agent believes it doesn’t have enough information to make a judgment. Monitoring changes in this ratio is a simple but effective system health indicator.

With a solid eval system, switching models becomes confident. Whenever a new model is released, running through eval tells you whether to switch. New models may make some things better but others worse. Without eval, you’re afraid to touch anything.

The Financial Person’s Claude.md

After Policy Agent launched, Will shared a discovery: engineers using Claude Code can modify the Claude.md file to control Agent behavior. Financial people have the same need—their “Claude.md” is just the company’s expense policy document.

If a Policy Agent decision is wrong, Ramp’s advice is: “Update your expense policy document.” This was initially scary for finance people. Expense policy is a formal document, not casually editable, requiring approval processes for changes. But once they realized modifications immediately showed effects, their attitude completely reversed. This feedback loop is very similar to engineers tuning an Agent: write strategy into documents, modify, immediately see system behavior change.

Trust building is phased. Ramp started rolling out to Fortune 500 enterprise clients first, because these have the largest expense volumes, deepest approval pain, and can most intuitively feel the product value. Initially offering only “suggestions,” no automatic actions.

Then clients came to them: “Under $20 transactions, your judgments are basically all correct. I don’t want to look anymore—let me auto-approve.”

So Ramp gave them an “autonomy slider,” letting clients decide how far to automate. Trust isn’t designed by product managers—it’s built by users through actual use.

Applied AI Service: Making Model Switching a One-Line Config

Ian Tracey took over the infrastructure portion. He said one core question Ramp is thinking about is: how do we get leverage for Ramp itself? Not just for customers, but for internal engineers and cross-functional teams.

They built a service called Applied AI Service. From a high level, it looks like an LLM gateway, similar to unified interface tools like LiteLLM, but with three important extensions.

First, unified structured output and consistent API/SDK across model providers. Different model providers’ APIs change quickly, and Ramp doesn’t want downstream product teams to worry about this. If you want to switch from GPT-5.3 to Opus, or try Gemini 3 Pro, change one config line and you can immediately run semantic similarity tests or sandbox experiments.

Second, cost governance. Applied AI Service consolidates model switching, structured output, batch processing, and cost governance—all the underlying headaches. For scenarios like batch document analysis or eval, how to handle rate limits, choosing online or offline tasks—none of this needs to concern downstream teams. This lets them identify Pareto curves for performance and cost: which model is optimal at which cost-performance point? Which teams’ usage is unsustainable long-term?

Ian mentioned a detail: they joke internally that customers might be running the latest frontier models before Ramp itself even realizes it. Because when a new model drops, one config change inside Ramp affects all downstream SDKs. Teams don’t need to learn new SDKs, don’t need to change a dozen call sites—one change lets them enjoy the latest model’s capabilities.

Third, the tool directory. It contains tools like “get policy snippet,” “check PDM rates,” “check recent transactions”—built by product teams with deep understanding of data and scenario nuances. This directory currently has hundreds of tools, expected to grow to thousands.

Ramp believes an Agent’s ceiling depends largely on the toolchest and context, not simply on which model you switch to.

One clever thing about this directory: it works in both internal repos and the core product. If you have a new idea for an “expense Agent,” just pick tools from the directory, assemble them, and quickly prototype in a vibe coding environment—no need to reinvent the wheel.

Ramp Inspect: The Internal Coding Agent Generating 50%+ of PRs

Ian then discussed a problem Ramp faces internally that’s similar to what their customers face: engineers’ daily work is also highly fragmented. Even using Claude Code or Codex, a lot of work is scattered across Datadog logs, production databases, alert systems, Jira, Figma, Slack messages, Notion documents, plus each product team’s unique knowledge and processes.

At the end of 2025, they decided to solve this, building an internal coding Agent called Inspect.

That month, Ramp Inspect generated over 50% of PRs merged to production.

Ian showed a usage dashboard. Engineering team’s usage is far ahead, but product, design, risk, legal, enterprise finance, even marketing and customer support teams are using it. What they do includes simple copy edits, logic fixes, incident responses, or bugs.

Inspect isn’t just “can write code”—it’s essentially a parallelizable, context-aware, sandboxed background coding system.

Technically, each Inspect session launches in a Modal sandbox, quickly. The sandbox contains a complete development environment, same as an engineer’s local development. It has task lists to maintain direction, creates GitHub branches, and integrates with all internal context: Datadog, read-only database replicas, various context documents. It also embeds a VS Code editor and remote desktop environment, can run Chrome DevTools, can do full-stack development. It has access to Ramp’s tool directory, and if CI fails, it fixes it itself before notifying you that the PR is ready.

There are three launch methods: kanban interface, API, or Slack thread. When launching from Slack, it reads the full Slack conversation context—no need to re-describe the problem.

This means when collaborating with designers or PMs, they can observe and guide the Agent’s behavior in the same Inspect session. They can click a link to see “this result doesn’t match my expectation,” and give feedback. This becomes an entry point for cross-functional collaboration, and helps non-engineers improve their prompting skills.

The AI Agent Era: High-Value Work Shifts from Coding to Judgment

Ian turned the conversation to engineering culture at the end. He proposed a thought experiment: imagine two types of teams.

Team A cares about impact, can handle ambiguous problems, understands products, business, and data, is willing to adopt new tools, finds creative solutions, and is obsessed with user experience.

Team B debates which library to use, adds processes when chaos emerges, constantly complains about not having enough people, gets lost in details rather than focusing on user experience, starts building things before understanding the problem (“let’s just vibe code this, bro, don’t worry”), or obsesses over subjective code style preferences.

Ian cited a Harvard study. Analyzing data from 6.2 million workers across 285,000 US companies (2015-2025), it found that after AI tools became widespread, entry-level hiring dropped about 7.7% over six quarters, while senior positions remained largely unaffected. This decline was primarily driven by slower hiring, not layoffs.

But Ian felt this study was overly simplified into a “junior vs. senior” years-of-experience narrative. In his view, the real dividing line isn’t visible on a resume—the gap between Team A and Team B is what matters.

Coding was never the hardest part of many jobs. Senior engineers get paid more mainly for their judgment: contextual understanding, ability to foresee risk, the “scar tissue” accumulated from past mistakes. When you ask Opus 4.6 to do something, an experienced person can tell when its approach won’t work, or is a bad idea in the first place.

Ian’s core point isn’t “AI replaces who,” but that the bar for high-value work is shifting from coding speed to judgment. With AI coding agents, you might just be wrong faster, and building a bigger mess faster.

Much of the media narrative misses this. They focus on “AI can write code now,” but don’t see that figuring out what to build, persuading skeptical stakeholders, making design decisions with incomplete information, and maintaining momentum through the long middle stretches of projects—these abilities only become more important.

The same goes for SaaS industry discussions around vibe coding. Yes, quickly prototyping is easy. But really crossing that “middle ground” from prototype to product-market fit requires truly excellent engineers, and this gets far too little discussion.

Ian closed with an optimistic judgment. He acknowledged there are many pessimistic narratives around AI, but views it as an exciting era for building. He quoted something from inside Ramp: “Jobs not finished.” Software is always in a state of “unfinished.” When the freed-up capacity is no longer consumed by low-level chores, four things happen: companies pursue opportunities they couldn’t previously afford, enter adjacent markets to stitch together more value for customers, rebuild old systems that were too costly to touch before, and raise the bar for “good enough.”

AI isn’t just about saving time—it pushes the boundary of “worth doing” outward.

Signals Worth Watching

Ramp left the industry with a complete reference through this talk. Their path emphasizes building correctness, context, auditability, and trust one brick at a time, rather than simply betting that models keep getting stronger and automatically solve everything.

The truly scarce capability is knowing what to use models for, and knowing whether they’re doing it right.


FAQ

Q: Why did Ramp choose “one Agent + a thousand skills” instead of building multiple specialized agents?

A: Last year they let teams experiment freely, resulting in multiple agent implementations and conversational interfaces creating serious maintenance burden. Converging to one Agent while expanding skills reduces complexity while maintaining scalability.

Q: What’s Policy Agent’s biggest challenge?

A: Not model capability, but providing sufficient context to the model. Context dimensions like employee level, receipt details, and merchant information need to be discovered and added incrementally through eval—more effective than simply switching models.

Q: Is Ramp Inspect friendly to non-engineers?

A: Very. Product, design, legal, marketing, and other teams are all using it. It supports launching from Slack, reads full conversation context, and non-engineers can directly participate in and guide the Agent’s behavior.


By the way, I regularly share AI tools and insights on my公众号「梦兽编程交个朋友」. Follow for weekly updates on AI programming and tool reviews.