For twenty years, software pricing had one unit: the person with a login. AI broke that unit. The work is now done partly by machines whose costs arrive by the token, whose customers consume by the agent, and whose margins behave nothing like SaaS. Three in four software companies changed their pricing in the past year. Most are improvising.
This paper is about doing it deliberately.
It answers three questions. First, how the AI cost crunch and new usage expectations are reshaping software pricing, and what product and pricing leaders need to do now. Second, what intelligence actually costs in mid-2026 — frontier APIs, open-weight models, on-prem GPUs, and the rest of the stack — and when to rent versus own. Third, how to build an operating model and a tech stack for finance-grade usage data, clean revenue recognition, and margin protection without a rip-and-replace billing project. Every figure is sourced; the full list is at the end.
Part I — The market shift: from seats to work
The pricing models are moving under our feet
The survey data tells one consistent story. In Kyle Poyar's 2025 State of B2B Monetization (240+ companies, surveyed April–May 2025), pure seat-based pricing fell from 21% to 15% of companies in a single year, flat-fee subscriptions fell from 29% to 22%, and hybrid pricing — a platform fee plus a usage or outcome meter — jumped from 27% to 41%. Metronome's January 2025 survey found 85% of software companies have adopted some form of usage-based pricing, nearly half of them within the last two years. ICONIQ's State of AI survey of ~300 software executives found 37% planning a pricing change within the year, with the top driver being customer demand for consumption- and outcome-based pricing (46%), followed by demand for predictability (40%) and competitive pressure (39%).
Note what those last two drivers say: customers want consumption pricing and they want predictability. The winning structures are hybrids precisely because they hold both — a committed base the CFO can budget, a meter that scales with value, and safeguards on top (ICONIQ: 49% use annual commitments, 29% tiered overages).
Bain's October 2025 analysis of 30+ established SaaS vendors adding generative AI found the same hybrid convergence from the other direction: roughly 35% simply raised per-seat prices and bundled AI in; about 65% adopted hybrid structures layering an AI meter on top of seats; none still monetized AI purely as a separate add-on — and none had fully abandoned seats either. The seat is not dead. It is being demoted from the unit of value to a unit of access.
Outcome pricing: the frontier, not the norm
The most discussed change is charging for results rather than usage. The flagship examples are real: Intercom's Fin charges $0.99 per resolution (now billed per "outcome" — no charge if the AI escalates to a human). Zendesk followed in March 2025 at roughly $1.50 per automated resolution, with a resolution confirming after 72 hours of ticket inactivity. Sierra built its business on outcome pricing from day one — paid on resolved conversations, saved cancellations, completed upsells. Bessemer's February 2026 playbook frames the era in one line: client-server monetized licenses, SaaS monetized access, AI will monetize outcomes.
But keep the base rate in view: in Poyar's survey only 5% of companies say outcome-based is their primary model today; 25% expect it to be by 2028. Outcome pricing requires an outcome you can define, attribute, and defend in a renewal negotiation. "Ticket resolved with no human touch and no reopen within 72 hours" clears that bar. Most AI value — drafting, summarizing, coding assistance — does not, yet.
Watch Salesforce as the bellwether of how hard this is: Agentforce launched at $2 per conversation (October 2024), added Flex Credits at $0.10 per action (May 2025), then added per-user "digital labor" licenses from about $125/user/month — three pricing models in roughly eighteen months. That is not indecision so much as live experimentation at enterprise scale, and it is what the whole market is doing.
The cost crunch is the forcing function
Why is everyone repricing? Because AI features carry real marginal cost, and classic SaaS pricing assumed there was none.
- AI-native gross margins run 25–60%, versus 70–90% for classic SaaS. Bessemer's 2025 benchmarks put fast-scaling AI "Supernovas" near 25% gross margin (some negative); ICONIQ's data shows AI-native product gross margins of 41% (2024), 45% (2025), and 52% in its January 2026 snapshot — improving, but far below the SaaS baseline. Scaling-stage AI companies spend roughly 23% of revenue on inference alone.
- Even the labs feel it. Outside estimates of OpenAI's and Anthropic's gross margins range from ~50–60% (mid-2025 estimates) to reports of margin forecasts being missed by ten points or more — all estimates of private companies, but uniformly far below software norms.
- Flat pricing plus power users is a loss machine. Anthropic disclosed that under Claude Code's original $200/month plan, some individual users were costing it tens of thousands of dollars per month. Replit's gross margin swung from ~10% to ~36% and back to ~23% within nine months as pricing and usage shifted. Cursor's June 2025 move from 500 fast requests to ~$20 of included usage at API rates triggered a public backlash and an apology — the right economics, communicated badly.
- Incumbents are repricing too. Canva raised Teams prices by up to 300% citing AI features. Microsoft holds 365 Copilot at $30/user/month and added an $18–21 Business tier. GitHub Copilot moved from $0.04 premium requests to token-denominated AI Credits ($0.01/credit, effective June 1, 2026) — with some agentic power users reporting bills 10–50x higher, which is the meter doing exactly what it was installed to do.
Agentic AI sharpens all of this. When agents do the work, seat counts decouple from value delivered — and from cost incurred. An agent run consumes thousands of model calls on your COGS line whether the customer has five logins or five hundred. Marc Benioff's framing is the industry's: per-user products are for humans; consumption products are for agents.
Move to a hybrid structure before the market moves you. A committed platform base (predictability for the buyer, revenue floor for you) plus a usage or outcome meter for AI work, with annual commitments and tiered overages as shock absorbers. Reprice toward outcomes only where the outcome is definable, attributable, and disputable in your favor.
Treat 2026 renewals as the test. Bessemer flags a renewal cliff as 2025's AI pilots hit their first renewals with real usage data on both sides of the table. Walk in knowing your per-customer AI gross margin, or your customer will set the price.
Learn from Cursor: sequencing and communication are half the work. Grandfather generously, publish the meter, give customers dashboards before you give them bills.
Part II — What intelligence costs: the full-stack pricing analysis
Cloud APIs: deflation below, inflation at the top
The defining tension of mid-2026 pricing: the cost of constant-capability intelligence keeps collapsing, while frontier list prices are rising.
On the deflation side: a16z's "LLMflation" analysis found inference for constant-quality output declining ~10x per year; Epoch AI puts the median decline at ~50x per year depending on capability threshold; Stanford's AI Index documented a 280-fold fall in GPT-3.5-class query costs in 18 months. DeepSeek made its 75% V4-Pro price cut permanent in May 2026 — $0.435 per million input tokens, $0.87 output, with cache hits at fractions of a cent.
On the inflation side: GPT-5 launched in August 2025 at $1.25/$10 per million tokens (input/output); GPT-5.5 now lists at $5/$30 — roughly 4x flagship input-price inflation in ten months. Google's Gemini 3.5 Flash costs 3x the Gemini 3 Flash it replaced. Anthropic cut Opus pricing 3x in November 2025 ($15/$75 → $5/$25), then introduced a premium tier above it: Claude Fable 5 at $10/$50. Deflation now comes from older, smaller, and open-weight models — not from flagships.
| Provider · Model | $/M input | $/M output | Notes |
|---|---|---|---|
| Anthropic · Claude Fable 5 | $10.00 | $50.00 | Premium tier; cache reads $1.00 |
| OpenAI · GPT-5.5 | $5.00 | $30.00 | Long-context (>272K) $10/$45; cached $0.50 |
| Anthropic · Claude Opus 4.8 | $5.00 | $25.00 | Fast mode 2x |
| Anthropic · Claude Sonnet 4.6 | $3.00 | $15.00 | Cache reads $0.30 |
| OpenAI · GPT-5.4 | $2.50 | $15.00 | Mini $0.75/$4.50 · Nano $0.20/$1.25 |
| Google · Gemini 3.1 Pro | $2.00 | $12.00 | >200K context: $4/$18 |
| Google · Gemini 3.5 Flash | $1.50 | $9.00 | 3x the Gemini 3 Flash it replaced |
| Mistral · Medium 3.5 | $1.50 | $7.50 | Large 3: $0.50/$1.50 |
| Anthropic · Claude Haiku 4.5 | $1.00 | $5.00 | Cache reads $0.10 |
| DeepSeek · V4-Pro (first-party) | $0.435 | $0.87 | Cache hits $0.0036; 75% cut made permanent May 2026 |
| Google · Gemini 3.1 Flash-Lite | $0.25 | $1.50 | |
| OpenAI · gpt-oss-120B (via Groq/Together/Fireworks) | $0.15 | $0.60 | Open-weight |
| DeepSeek · V4-Flash (first-party) | $0.14 | $0.28 | |
| Meta · Llama 4 Scout (via Groq/DeepInfra) | $0.10–0.11 | $0.30–0.34 | Open-weight |
| Alibaba · Qwen3-235B (via DeepInfra/Together) | $0.09–0.20 | $0.10–0.60 | Open-weight |
Standard-tier list prices per 1M tokens, fetched from provider pricing pages June 10, 2026. Open-weight prices vary by host — DeepSeek V4-Pro costs 3–5x more via Western hosts (Together $2.10/$4.40; Fireworks $1.74/$3.48; DeepInfra $1.30/$2.60) than first-party. Always specify the host when quoting an "open-source model price."
Four pricing mechanics now matter as much as the headline rates, because they are where your COGS is actually won or lost:
- Prompt caching: cache reads cost ~10% of list input at the frontier labs (OpenAI, Anthropic, Google) and as little as ~1–2% at DeepSeek. For agentic workloads that re-send long system prompts and tool definitions on every call, caching is routinely the single largest cost lever.
- Batch processing: a 50% discount is now universal (OpenAI, Anthropic, Google, Groq, Fireworks). Anything not latency-sensitive — evals, enrichment, classification backfills — should never run at standard rates.
- Premium surcharges are proliferating: priority/latency tiers at 1.5–2.5x, long-context surcharges at all three major labs, ~10% uplifts for data residency, and reasoning tokens billed as output. Your blended $/token can drift far above list without anyone deciding it should.
- Reasoning models inflate output token volume: thinking tokens are billed as output, so effective cost per visible token on reasoning workloads is materially higher than list price implies.
On-prem vs cloud: own the baseline, rent the peaks
The build-vs-buy question has three layers: buy tokens (APIs), rent GPUs (cloud), or own GPUs (on-prem/colo). The economics in mid-2026:
| Layer | Representative cost (June 2026) | Notes |
|---|---|---|
| H100 purchase | $25–40K/GPU; ~$250K per 8-GPU server | Used SXM5 trades at $12–22K; no official list price |
| H200 / B200 purchase | $30–40K / $30–50K per GPU | 8x B200 server ~$338K (Lenovo published) |
| GB200 NVL72 rack | ~$3M per 72-GPU rack | Vera Rubin racks quoted $5–8.8M — rack prices rising 2–3x per generation |
| H100 rental — neocloud | $2.46–3.99/GPU-hr on-demand | Lambda $3.99 · RunPod $3.29 · Nebius $3.85 ($2.15 preemptible) · spot floors ~$1.03–1.49 |
| H100 rental — hyperscaler | $6.88–12.29/GPU-hr on-demand | 2–3x neocloud; AWS cut P5 prices 44% in June 2025 under neocloud pressure |
| B200 rental | $5.89–8.60/GPU-hr | RunPod $5.89 · Lambda $6.69 · CoreWeave $8.60 |
| Power & cooling | 8x H100 ≈ $10.5K/yr electricity; GB200 rack ≈ $105K/yr | At $0.10–0.12/kWh; cooling adds 25–40%; liquid cooling cuts PUE ~1.5 → ~1.1 |
| Ops staffing | ~0.5 FTE ≈ $225–300K per 3 years | Plus 10–20 eng-hours/month maintenance per serious deployment |
GPU purchase figures are street estimates — NVIDIA publishes no list prices and ranges across sources span 30–50%. Rental rates fetched from live pricing pages June 10, 2026.
The decline in rental prices is real but no longer monotonic: H100 rates collapsed ~64–75% from their 2023 peaks ($8+/hr to $2–4/hr), but SemiAnalysis's one-year contract index rose ~40% between October 2025 and March 2026 ($1.70 → $2.35/hr) as inference demand absorbed surplus capacity. Plan on cyclical, not perpetually falling, GPU prices.
Self-hosting open-weight models: the math. Production-grade serving stacks are now very good — vLLM sustains ~2,200 output tokens/sec per H200 for DeepSeek-class MoE models in multi-node deployments; MLPerf shows a Llama-70B-class model producing ~30,500 tokens/sec on an 8x H100 node. Run the arithmetic and a fully utilized owned 8x H100 server produces output tokens at roughly $0.11 per million, and a rented H200 node serving DeepSeek lands near $0.57 per million at full utilization — against $15–50 per million output for frontier APIs. The raw spread is enormous. Three corrections shrink it:
- Utilization is everything. Halve utilization and your $/token doubles; most enterprise inference loads are bursty and diurnal. The rule of thumb: owning beats hyperscaler on-demand rental above roughly 20% sustained utilization (Lenovo's published math: ~4.3 hours/day over five years), but against neocloud or reserved pricing the threshold doubles or triples.
- Hidden costs run 3–5x the naive spreadsheet. Engineering labor, model upgrades every few months, evaluation, failover, and utilization waste. Budget them or your "cheap" cluster becomes an expensive hobby.
- Your competitor is the budget API, not the flagship. Against DeepSeek first-party at $0.435/$0.87 or gpt-oss-120B at $0.15/$0.60, self-hosting only wins at sustained volumes of 50–100M+ tokens/month — below ~500K tokens/day, pay-per-token is almost always cheaper.
Default to APIs until you have (a) sustained, predictable volume, (b) a quality-acceptable open-weight model for the task, and (c) the engineering bench to run it. Rent before you buy: prove utilization on neocloud reserved capacity first — it prices at $2–4/GPU-hr and converts your model to opex. Buy for the baseline, burst to cloud: own capacity sized to your trough load (where utilization stays high), rent or call APIs for peaks. Buy on-prem outright only for data-sovereignty requirements or token volumes where the 3–5x hidden-cost multiple still leaves a clear win over budget APIs. Revisit annually: both API prices and GPU economics are moving too fast for a three-year-old decision to stay correct.
The rest of the stack: cheap, and converging on the same pattern
Every layer around the model has settled into the same shape: open-source core, low flat platform fee, usage metering on top.
- Vector databases: Pinecone runs free → $20 → $50-minimum → $500-minimum tiers; Weaviate Cloud from $45/month priced per million vector dimensions; Zilliz from ~$5–63 per million vectors depending on performance tier; pgvector on a ~$100/month Postgres instance remains the budget default for workloads under tens of millions of vectors.
- LLM observability and evals: LangSmith at $39/seat plus $2.50 per 1,000 traces; Langfuse (open-source core) $29–199/month tiers; Braintrust $249/month; Helicone $79/month. Single-digit percent of inference spend, and worth every basis point — you cannot optimize what you cannot trace.
- Gateways and routing: OpenRouter passes through provider prices and monetizes via a ~5.5% credit fee; Cloudflare's AI Gateway core features are free; LiteLLM is open source; Portkey from $49/month. The gateway is the cheapest insurance in the stack: one integration point for caching, failover, routing, and per-customer spend caps.
- Managed inference: the same H100 costs $3.95/hr on Modal, $5.49 on Replicate, $6.50 on Baseten — a 65% spread for identical silicon, priced by convenience and cold-start engineering. vLLM, SGLang, and TensorRT-LLM are free; their cost is the GPU underneath plus your ops time.
- Embeddings are a rounding error: $0.02–0.13 per million tokens (OpenAI, Voyage, Cohere). Embedding a million-token corpus costs two cents.
The structural takeaway: model inference is the only layer with real COGS gravity — every other layer is one to three orders of magnitude cheaper. But note the survey finding from Mavvrik's cost-governance research: "data platform usage" (56%) and "networking/egress" (52%) were cited as unexpected AI costs more often than LLM tokens (37%), and 84% of companies reported AI delivery costs cutting product gross margins by more than six points. The model bill is the biggest line; it is rarely the surprising one.
The unit of software value is no longer the person with a login. It is the work the machine did — and the strategic question is whether you can meter it to finance-grade.
Part III — Finance-grade usage data without ripping out billing
The market just told you metering matters
In sixteen months, the three biggest distribution owners in monetization each bought a metering engine. Stripe completed its acquisition of Metronome — which powers billing for OpenAI, Anthropic, and NVIDIA — in January 2026, at a reported ~$1B. Salesforce signed to acquire m3ter in June 2026 to bring metering and rating natively into Agentforce Revenue Management. Zuora bought Togai in 2024 and was itself taken private at $1.7B. Usage metering has gone from venture category to platform feature. The remaining independents — Orb (Vercel, Pinecone, Perplexity, Replit), open-source Lago (Mistral, Together), Amberflo — compete on flexibility; Stripe Billing's own usage product takes 0.7% of billing volume.
The strategic read: you will not have to build metering infrastructure from scratch, and you should not. What you do have to build is the discipline around it.
The pattern: a metering layer in front of what you already run
Nobody successful is doing rip-and-replace. The architecture that works is a mediation layer: meter where the work happens, rate and aggregate in a dedicated system, and feed rated results into the billing, CRM, and ERP systems you already own. Salesforce's own framing of the m3ter acquisition is exactly this — metering data flows across CRM, ERP, and quote-to-cash, not a new billing system. Confluent replaced a homegrown billing stack with a metering platform and eliminated weeks of manual work per pricing change; Ideogram stood up dynamic AI pricing in under a month with no internal billing team.
Pragmatically, finance-grade usage data means five properties, and they are requirements, not aspirations:
- One canonical event stream. Every monetizable action — tokens, agent runs, resolutions — emitted once, with customer ID, feature, model, and timestamp. Product analytics can sample; billing data cannot.
- Idempotent, immutable, replayable events, so a pipeline bug is a reprocessing job rather than a credit-memo apology tour.
- Rating separated from metering, with versioned, effective-dated price rules — so pricing can change quarterly without breaking the close or rewriting history.
- Cost joined to revenue at the same grain. The margin question — which customer, on which feature, at what model cost — must be answerable from one table. This is what the usage ↔ billing ↔ payments ↔ GL reconciliation layer exists to do.
- Entitlements as a service, not as if-statements. Plan limits, feature gates, and token budgets belong in a dedicated layer (Stigg, Schematic, or built equivalents) that product and commercial teams can change without an engineering release. This is also your kill switch when a customer's agent goes runaway.
Revenue recognition: credits are a liability until consumed
AI pricing has made prepaid credits the dominant enterprise structure, and credits have real ASC 606 mechanics. The practitioner consensus (Leapfin, Ordway, and peers — notably, the Big 4 have not yet published AI-token-specific guidance, so vendor-led practice is the current state of the art):
- Prepaid credits are a contract liability at sale; revenue is recognized as credits are consumed — which makes your usage pipeline a revenue-recognition system whether finance has noticed or not.
- Promotional credits need a material-right assessment: if they convey a material right, allocate transaction price; if they're pure marketing, expense them.
- Breakage on expiring credits is recognized proportionally to usage if reasonably estimable, otherwise only when redemption becomes remote. (Sell $100K of credits expecting 8% expiry: recognize breakage as the active credits draw down.)
- Drawdown order must be deterministic and documented — e.g., promotional → prepaid → pay-as-you-go — because auditors will ask, and "it depends which microservice answered" is not an accounting policy.
- Platform access is typically a stand-ready obligation recognized ratably, separate from the consumption element — most AI contracts are therefore multi-element by default.
Margin protection: an engineering discipline with a finance dashboard
The cost side has a now-standard toolkit, and the savings are large enough to be strategy rather than housekeeping:
- Model routing: RouteLLM (UC Berkeley/LMSYS, ICLR 2025) showed trained routers cutting costs 35–85% while retaining 95% of frontier-model quality. Route easy traffic to cheap models; reserve the flagship for queries that earn it.
- Prompt caching: 90% off cached input at the frontier labs (Anthropic cache reads at 0.1x; OpenAI cached input at 10% of list). For agent workloads, cache hit rate is a first-class engineering KPI.
- Batch everything that can wait: the universal 50% discount.
- Per-customer token budgets and spend alerts, enforced at the gateway or billing layer, with automated actions at thresholds — the difference between an unprofitable customer and an unprofitable quarter.
- FinOps for AI is now formalized: the FinOps Foundation publishes working-group guidance (model right-sizing, training-vs-inference tagging taxonomies, token-level showback) and launched a FinOps for AI certification in 2025. Adopt the vocabulary; it gives engineering and finance a shared language.
Days 1–30: Instrument the canonical usage event stream and stand up cost observability (gateway + tracing). Produce the first per-customer, per-feature AI margin report — however ugly.
Days 31–60: Pick the metering/rating layer (buy, not build) and run it shadow-mode against current invoices. Move entitlements out of code. Agree the ASC 606 credit policy with your auditors before the next pricing change ships.
Days 61–90: Turn on margin protection — routing, caching, budgets, alerts. Launch the repriced offer to a cohort, grandfathered and dashboard-first. Review the margin report with product, engineering, and finance in the same room, monthly, forever.
Part IV — The operating model: Product, Tech & Finance as one pricing team
The deeper change AI forces is organizational. Pricing used to be an annual marketing exercise. It is now a continuous, cross-functional control loop — the top 500 software companies with public pricing made 1,800+ pricing changes in 2025, roughly 3.6 each. A loop that fast cannot run through committees. It runs through clear ownership:
01Product owns the unit of value.
Product's job is to define monetizable use cases: which AI work maps to which billable unit — a resolution, an action, a task, a token bundle, a seat that happens to include AI. The test for each candidate unit: the customer can predict it, finance can recognize it, engineering can meter it, and a renewal conversation can defend it. Hybrid is the default answer (Bessemer calls it the effective middle ground); outcome units are earned one use case at a time, starting where attribution is clean. Product also owns the packaging ladder — which models, what context, how much priority each tier buys — because in AI products, the model menu is the packaging.
02Technology owns the meter and the margin.
Engineering's monetization deliverables are: the canonical event stream; entitlements enforcement at runtime; the gateway layer (routing, caching, failover, budgets); and cost attribution at the same grain as revenue. The infrastructure decisions — API vs open-weight, rent vs own, which providers — are now pricing decisions, made jointly with finance against the break-even math in Part II. The engineering culture shift: tokens are COGS, cache hit rate is a margin metric, and shipping an AI feature without a meter is shipping an unpriced liability.
03Finance owns truth and the guardrails.
Finance turns the meter into money: revenue-grade reconciliation from usage to GL, the credit/breakage policy agreed with auditors, forecasting models for consumption revenue (harder than seats — variable consideration, commit-and-drawdown dynamics, usage seasonality), and the per-customer margin dashboard that triggers repricing conversations. Finance also sets the guardrails product and engineering operate inside: minimum gross margin per offer, maximum unhedged exposure to a single model provider, spend thresholds where automated protection kicks in.
The cadence that binds the three: a monthly monetization review — usage trends, margin by customer and feature, cache and routing performance, pricing experiment results — and a quarterly pricing release, treated with the same discipline as a product release: versioned, tested, communicated, reversible.
What we'll grant the other side
We believe what we just wrote. We will also push back on ourselves.
The seat may outlive its obituary. Bain found no established vendor that fully abandoned seat pricing, and 35% succeeded by simply raising seat prices with AI bundled in. If your AI feature makes each human meaningfully more productive and your costs per user are bounded, a higher seat price is simpler, more predictable, and easier to sell than any meter. Complexity is a real price customers pay; do not spend it without need.
The cost crunch may ease. Constant-capability token prices fall 10–50x per year, and for many products, "good enough" intelligence keeps getting cheaper faster than usage grows. Margins are already recovering — 41% to 52% in two years on ICONIQ's data. The bet against this: agentic workloads are growing token consumption per task faster than unit prices fall at the frontier, and competitive pressure pushes everyone toward the newest, priciest models. We think the crunch persists for frontier-dependent products and eases for everyone else — which is itself an argument for the routing-and-hybrid strategy in this paper.
Outcome pricing can misalign as easily as align. A vendor paid per resolution is incentivized to mark conversations resolved; attribution disputes consume the trust the model was meant to build; and customers may rationally prefer a predictable bill over a fair one. The 72-hour confirmation windows and no-charge escalation rules are honest engineering around real failure modes — they are also evidence of how much scaffolding outcome pricing needs.
And metering can hurt. Cursor's backlash and GitHub Copilot users' 10–50x bills show the meter transferring risk to the customer. The companies that win with usage pricing spend as much on predictability — caps, alerts, included allowances, dashboards — as on the meter itself.
Why this matters now
Every actor in the system is repricing at once: the labs are raising flagship prices while open-weight models race to the floor; the platforms just spent billions buying metering engines; your customers' procurement teams are walking into 2026 renewals with usage dashboards of their own. The companies that thrive will be those that can see their cost and value at the same grain — and change price as routinely as they change code. The window in which "we're still figuring out AI pricing" is an acceptable answer is closing.
What we're committing to
The AI Impact Foundation will keep this analysis current — pricing tables re-verified against primary sources, the on-prem break-even math updated as GPU and API prices move — and will teach monetization in our courses the way this paper argues it should be practiced: as a cross-functional discipline with finance-grade data underneath it, not a slide in a board deck.
That's the work. Come help us do it.
Sources
- Kyle Poyar, The State of B2B Monetization in 2025, Growth Unhinged (survey n=240+, June 2025)
- Bain & Company, Per-Seat Software Pricing Isn't Dead, but New Models Are Gaining Steam (October 2025)
- Bessemer Venture Partners, The AI Pricing and Monetization Playbook (February 2026) and The State of AI 2025
- Metronome, State of Usage-Based Pricing 2025 (survey n=100, January 2025)
- ICONIQ, State of AI — 2026 Bi-Annual Snapshot (~300 software executives)
- a16z, LLMflation: LLM Inference Cost Is Falling Fast (November 2024); Epoch AI, LLM Inference Price Trends; Stanford HAI, AI Index Report 2025
- Tanay Jaipuria, The State of AI Gross Margins in 2025 (September 2025)
- Provider pricing pages, fetched June 10, 2026: OpenAI, Anthropic, Google Gemini, DeepSeek, Mistral, Together AI, Fireworks, Groq, DeepInfra
- GPU economics: Lenovo Press, On-Premise vs Cloud: Generative AI TCO, 2026 Edition; SemiAnalysis, GPU Cloud Economics Explained and H100 rental price index; live pricing from Lambda, RunPod, Nebius, CoreWeave (June 10, 2026)
- Serving benchmarks: vLLM, Large-Scale Serving: DeepSeek at 2.2K tok/s per H200 (December 2025)
- Outcome pricing: Intercom Fin; Zendesk outcome-based pricing (March 2025); Sierra, Outcome-Based Pricing for AI Agents; Salesforce Agentforce Flex Credits
- Repricing cases: Cursor, Clarifying Our Pricing (June–July 2025); GitHub Copilot premium requests / AI Credits; Fortune on Canva Teams pricing; Microsoft 365 Copilot pricing
- Billing consolidation: Stripe completes Metronome acquisition (January 2026); Salesforce to acquire m3ter (June 2026); Zuora / Silver Lake & GIC
- Revenue recognition: Leapfin, The AI Rev Rec Playbook for Usage-Based Pricing (November 2025); Ordway, ASC 606 for Usage-Based Pricing; PwC Viewpoint, Identifying Performance Obligations
- Margin protection: LMSYS, RouteLLM (ICLR 2025); Anthropic prompt caching docs; FinOps Foundation, FinOps for AI; Mavvrik AI Cost Governance Report
- Stack pricing pages, fetched June 10, 2026: Pinecone, Weaviate, Zilliz, LangSmith, Langfuse, Braintrust, Helicone, OpenRouter, Modal, Replicate, Baseten, Voyage AI
Notes on confidence: GPU purchase prices are street estimates (no official list prices exist); OpenAI/Anthropic gross-margin figures are outside estimates of private companies from paywalled reporting and are presented as ranges; the Stripe–Metronome price (~$1B) is reported, not company-confirmed. All API and stack prices were fetched from primary vendor pages on June 10, 2026 and will drift — quickly.