Insights · Issue 03 · June 2026

State of AI monetization

The cost crunch is real, the seat is dying slowly, and the meter is becoming the contract. How product, technology, and finance leaders should price the machine's work — across cloud APIs, open-source models, and the racks in between.

Read the report ↓
Read time 25 min Authors AI Impact Foundation Topic Pricing, AI economics, monetization

For twenty years, software pricing had one unit: the person with a login. AI broke that unit. The work is now done partly by machines whose costs arrive by the token, whose customers consume by the agent, and whose margins behave nothing like SaaS. Three in four software companies changed their pricing in the past year. Most are improvising.

This paper is about doing it deliberately.

It answers three questions. First, how the AI cost crunch and new usage expectations are reshaping software pricing, and what product and pricing leaders need to do now. Second, what intelligence actually costs in mid-2026 — frontier APIs, open-weight models, on-prem GPUs, and the rest of the stack — and when to rent versus own. Third, how to build an operating model and a tech stack for finance-grade usage data, clean revenue recognition, and margin protection without a rip-and-replace billing project. Every figure is sourced; the full list is at the end.

Part I — The market shift: from seats to work

The pricing models are moving under our feet

The survey data tells one consistent story. In Kyle Poyar's 2025 State of B2B Monetization (240+ companies, surveyed April–May 2025), pure seat-based pricing fell from 21% to 15% of companies in a single year, flat-fee subscriptions fell from 29% to 22%, and hybrid pricing — a platform fee plus a usage or outcome meter — jumped from 27% to 41%. Metronome's January 2025 survey found 85% of software companies have adopted some form of usage-based pricing, nearly half of them within the last two years. ICONIQ's State of AI survey of ~300 software executives found 37% planning a pricing change within the year, with the top driver being customer demand for consumption- and outcome-based pricing (46%), followed by demand for predictability (40%) and competitive pressure (39%).

Primary pricing model, share of companies 2024 2025 Seat-based 21% 15% Flat subscription 29% 22% Hybrid (base + meter) 27% 41%
Year-over-year shift in primary pricing model among 240+ surveyed software companies. Source: Kyle Poyar, The State of B2B Monetization in 2025, Growth Unhinged (April–May 2025 survey).

Note what those last two drivers say: customers want consumption pricing and they want predictability. The winning structures are hybrids precisely because they hold both — a committed base the CFO can budget, a meter that scales with value, and safeguards on top (ICONIQ: 49% use annual commitments, 29% tiered overages).

Bain's October 2025 analysis of 30+ established SaaS vendors adding generative AI found the same hybrid convergence from the other direction: roughly 35% simply raised per-seat prices and bundled AI in; about 65% adopted hybrid structures layering an AI meter on top of seats; none still monetized AI purely as a separate add-on — and none had fully abandoned seats either. The seat is not dead. It is being demoted from the unit of value to a unit of access.

3 in 4
software companies changed pricing in the past year (Growth Unhinged, 2025)
85%
have adopted some form of usage-based pricing (Metronome, 2025)
23%
of revenue goes to inference at scaling-stage AI companies (ICONIQ, 2026)
5% → 25%
outcome-based as primary model: today vs expected by 2028 (Growth Unhinged)

Outcome pricing: the frontier, not the norm

The most discussed change is charging for results rather than usage. The flagship examples are real: Intercom's Fin charges $0.99 per resolution (now billed per "outcome" — no charge if the AI escalates to a human). Zendesk followed in March 2025 at roughly $1.50 per automated resolution, with a resolution confirming after 72 hours of ticket inactivity. Sierra built its business on outcome pricing from day one — paid on resolved conversations, saved cancellations, completed upsells. Bessemer's February 2026 playbook frames the era in one line: client-server monetized licenses, SaaS monetized access, AI will monetize outcomes.

But keep the base rate in view: in Poyar's survey only 5% of companies say outcome-based is their primary model today; 25% expect it to be by 2028. Outcome pricing requires an outcome you can define, attribute, and defend in a renewal negotiation. "Ticket resolved with no human touch and no reopen within 72 hours" clears that bar. Most AI value — drafting, summarizing, coding assistance — does not, yet.

Watch Salesforce as the bellwether of how hard this is: Agentforce launched at $2 per conversation (October 2024), added Flex Credits at $0.10 per action (May 2025), then added per-user "digital labor" licenses from about $125/user/month — three pricing models in roughly eighteen months. That is not indecision so much as live experimentation at enterprise scale, and it is what the whole market is doing.

The cost crunch is the forcing function

Why is everyone repricing? Because AI features carry real marginal cost, and classic SaaS pricing assumed there was none.

The gross margin gap 0% 25% 50% 75% 100% Classic SaaS 75–85% AI-native, 2026 52% AI-native, 2024 41% AI “Supernovas” ~25%
Product gross margins. AI-native figures from ICONIQ's State of AI surveys (~300 software executives; 2024 reported, 2026 bi-annual snapshot); "Supernovas" — fast-scaling AI natives, some with negative margins — from Bessemer's State of AI 2025; classic SaaS range per Bessemer benchmarks.

Agentic AI sharpens all of this. When agents do the work, seat counts decouple from value delivered — and from cost incurred. An agent run consumes thousands of model calls on your COGS line whether the customer has five logins or five hundred. Marc Benioff's framing is the industry's: per-user products are for humans; consumption products are for agents.

What this means for pricing leaders — now

Move to a hybrid structure before the market moves you. A committed platform base (predictability for the buyer, revenue floor for you) plus a usage or outcome meter for AI work, with annual commitments and tiered overages as shock absorbers. Reprice toward outcomes only where the outcome is definable, attributable, and disputable in your favor.

Treat 2026 renewals as the test. Bessemer flags a renewal cliff as 2025's AI pilots hit their first renewals with real usage data on both sides of the table. Walk in knowing your per-customer AI gross margin, or your customer will set the price.

Learn from Cursor: sequencing and communication are half the work. Grandfather generously, publish the meter, give customers dashboards before you give them bills.

Part II — What intelligence costs: the full-stack pricing analysis

Cloud APIs: deflation below, inflation at the top

The defining tension of mid-2026 pricing: the cost of constant-capability intelligence keeps collapsing, while frontier list prices are rising.

On the deflation side: a16z's "LLMflation" analysis found inference for constant-quality output declining ~10x per year; Epoch AI puts the median decline at ~50x per year depending on capability threshold; Stanford's AI Index documented a 280-fold fall in GPT-3.5-class query costs in 18 months. DeepSeek made its 75% V4-Pro price cut permanent in May 2026 — $0.435 per million input tokens, $0.87 output, with cache hits at fractions of a cent.

On the inflation side: GPT-5 launched in August 2025 at $1.25/$10 per million tokens (input/output); GPT-5.5 now lists at $5/$30 — roughly 4x flagship input-price inflation in ten months. Google's Gemini 3.5 Flash costs 3x the Gemini 3 Flash it replaced. Anthropic cut Opus pricing 3x in November 2025 ($15/$75 → $5/$25), then introduced a premium tier above it: Claude Fable 5 at $10/$50. Deflation now comes from older, smaller, and open-weight models — not from flagships.

Provider · Model$/M input$/M outputNotes
Anthropic · Claude Fable 5$10.00$50.00Premium tier; cache reads $1.00
OpenAI · GPT-5.5$5.00$30.00Long-context (>272K) $10/$45; cached $0.50
Anthropic · Claude Opus 4.8$5.00$25.00Fast mode 2x
Anthropic · Claude Sonnet 4.6$3.00$15.00Cache reads $0.30
OpenAI · GPT-5.4$2.50$15.00Mini $0.75/$4.50 · Nano $0.20/$1.25
Google · Gemini 3.1 Pro$2.00$12.00>200K context: $4/$18
Google · Gemini 3.5 Flash$1.50$9.003x the Gemini 3 Flash it replaced
Mistral · Medium 3.5$1.50$7.50Large 3: $0.50/$1.50
Anthropic · Claude Haiku 4.5$1.00$5.00Cache reads $0.10
DeepSeek · V4-Pro (first-party)$0.435$0.87Cache hits $0.0036; 75% cut made permanent May 2026
Google · Gemini 3.1 Flash-Lite$0.25$1.50
OpenAI · gpt-oss-120B (via Groq/Together/Fireworks)$0.15$0.60Open-weight
DeepSeek · V4-Flash (first-party)$0.14$0.28
Meta · Llama 4 Scout (via Groq/DeepInfra)$0.10–0.11$0.30–0.34Open-weight
Alibaba · Qwen3-235B (via DeepInfra/Together)$0.09–0.20$0.10–0.60Open-weight

Standard-tier list prices per 1M tokens, fetched from provider pricing pages June 10, 2026. Open-weight prices vary by host — DeepSeek V4-Pro costs 3–5x more via Western hosts (Together $2.10/$4.40; Fireworks $1.74/$3.48; DeepInfra $1.30/$2.60) than first-party. Always specify the host when quoting an "open-source model price."

What a million output tokens costs · log scale $0.10 $1 $10 $100 Claude Fable 5 $50 GPT-5.5 $30 Claude Opus 4.8 $25 GPT-5.4 · Sonnet 4.6 $15 Gemini 3.1 Pro $12 Claude Haiku 4.5 $5 DeepSeek V4-Pro $0.87 gpt-oss-120B $0.60 Llama 4 Scout $0.32 DeepSeek V4-Flash $0.28
Output price per 1M tokens, standard tier, fetched June 10, 2026 (violet: proprietary frontier; magenta: open-weight / budget). The ~180x spread between the priciest frontier token and the cheapest capable open-weight token is the model-routing opportunity in one picture.

Four pricing mechanics now matter as much as the headline rates, because they are where your COGS is actually won or lost:

On-prem vs cloud: own the baseline, rent the peaks

The build-vs-buy question has three layers: buy tokens (APIs), rent GPUs (cloud), or own GPUs (on-prem/colo). The economics in mid-2026:

LayerRepresentative cost (June 2026)Notes
H100 purchase$25–40K/GPU; ~$250K per 8-GPU serverUsed SXM5 trades at $12–22K; no official list price
H200 / B200 purchase$30–40K / $30–50K per GPU8x B200 server ~$338K (Lenovo published)
GB200 NVL72 rack~$3M per 72-GPU rackVera Rubin racks quoted $5–8.8M — rack prices rising 2–3x per generation
H100 rental — neocloud$2.46–3.99/GPU-hr on-demandLambda $3.99 · RunPod $3.29 · Nebius $3.85 ($2.15 preemptible) · spot floors ~$1.03–1.49
H100 rental — hyperscaler$6.88–12.29/GPU-hr on-demand2–3x neocloud; AWS cut P5 prices 44% in June 2025 under neocloud pressure
B200 rental$5.89–8.60/GPU-hrRunPod $5.89 · Lambda $6.69 · CoreWeave $8.60
Power & cooling8x H100 ≈ $10.5K/yr electricity; GB200 rack ≈ $105K/yrAt $0.10–0.12/kWh; cooling adds 25–40%; liquid cooling cuts PUE ~1.5 → ~1.1
Ops staffing~0.5 FTE ≈ $225–300K per 3 yearsPlus 10–20 eng-hours/month maintenance per serious deployment

GPU purchase figures are street estimates — NVIDIA publishes no list prices and ranges across sources span 30–50%. Rental rates fetched from live pricing pages June 10, 2026.

The decline in rental prices is real but no longer monotonic: H100 rates collapsed ~64–75% from their 2023 peaks ($8+/hr to $2–4/hr), but SemiAnalysis's one-year contract index rose ~40% between October 2025 and March 2026 ($1.70 → $2.35/hr) as inference demand absorbed surplus capacity. Plan on cyclical, not perpetually falling, GPU prices.

H100 cloud rental, $/GPU-hour $0 $4 $8 2023 2024 2025 2026 2026 spot floors $1.03–1.49 $8+ ~$4.50 $1.70 $2.35
Indicative H100 rates: 2023–24 from Silicon Data / market trackers; Oct 2025 ($1.70) and Mar 2026 ($2.35, +40%) from SemiAnalysis's one-year contract index. Neocloud on-demand sat at $2.46–3.99/GPU-hr in June 2026; hyperscalers 2–3x higher. The decline is real — and no longer monotonic.

Self-hosting open-weight models: the math. Production-grade serving stacks are now very good — vLLM sustains ~2,200 output tokens/sec per H200 for DeepSeek-class MoE models in multi-node deployments; MLPerf shows a Llama-70B-class model producing ~30,500 tokens/sec on an 8x H100 node. Run the arithmetic and a fully utilized owned 8x H100 server produces output tokens at roughly $0.11 per million, and a rented H200 node serving DeepSeek lands near $0.57 per million at full utilization — against $15–50 per million output for frontier APIs. The raw spread is enormous. Three corrections shrink it:

  1. Utilization is everything. Halve utilization and your $/token doubles; most enterprise inference loads are bursty and diurnal. The rule of thumb: owning beats hyperscaler on-demand rental above roughly 20% sustained utilization (Lenovo's published math: ~4.3 hours/day over five years), but against neocloud or reserved pricing the threshold doubles or triples.
  2. Hidden costs run 3–5x the naive spreadsheet. Engineering labor, model upgrades every few months, evaluation, failover, and utilization waste. Budget them or your "cheap" cluster becomes an expensive hobby.
  3. Your competitor is the budget API, not the flagship. Against DeepSeek first-party at $0.435/$0.87 or gpt-oss-120B at $0.15/$0.60, self-hosting only wins at sustained volumes of 50–100M+ tokens/month — below ~500K tokens/day, pay-per-token is almost always cheaper.
Output cost per 1M tokens: self-host vs API · log scale $0.10 $1 $10 $100 Owned 8×H100, full util. $0.11 Rented H200 node, full util. $0.57 DeepSeek V4-Pro API $0.87 GPT-5.5 API $30 Claude Fable 5 API $50
Self-host figures assume open-weight models (Llama-70B-class on owned 8×H100 per Lenovo/MLPerf; DeepSeek-class on a rented H200 node per vLLM benchmarks) at 100% utilization — halve the utilization and the cost doubles, and real deployments carry 3–5x hidden costs. The honest comparison for most teams is the magenta bars, not the violet ones.
A decision framework that survives contact with finance

Default to APIs until you have (a) sustained, predictable volume, (b) a quality-acceptable open-weight model for the task, and (c) the engineering bench to run it. Rent before you buy: prove utilization on neocloud reserved capacity first — it prices at $2–4/GPU-hr and converts your model to opex. Buy for the baseline, burst to cloud: own capacity sized to your trough load (where utilization stays high), rent or call APIs for peaks. Buy on-prem outright only for data-sovereignty requirements or token volumes where the 3–5x hidden-cost multiple still leaves a clear win over budget APIs. Revisit annually: both API prices and GPU economics are moving too fast for a three-year-old decision to stay correct.

The rest of the stack: cheap, and converging on the same pattern

Every layer around the model has settled into the same shape: open-source core, low flat platform fee, usage metering on top.

The structural takeaway: model inference is the only layer with real COGS gravity — every other layer is one to three orders of magnitude cheaper. But note the survey finding from Mavvrik's cost-governance research: "data platform usage" (56%) and "networking/egress" (52%) were cited as unexpected AI costs more often than LLM tokens (37%), and 84% of companies reported AI delivery costs cutting product gross margins by more than six points. The model bill is the biggest line; it is rarely the surprising one.

The unit of software value is no longer the person with a login. It is the work the machine did — and the strategic question is whether you can meter it to finance-grade.

Part III — Finance-grade usage data without ripping out billing

The market just told you metering matters

In sixteen months, the three biggest distribution owners in monetization each bought a metering engine. Stripe completed its acquisition of Metronome — which powers billing for OpenAI, Anthropic, and NVIDIA — in January 2026, at a reported ~$1B. Salesforce signed to acquire m3ter in June 2026 to bring metering and rating natively into Agentforce Revenue Management. Zuora bought Togai in 2024 and was itself taken private at $1.7B. Usage metering has gone from venture category to platform feature. The remaining independents — Orb (Vercel, Pinecone, Perplexity, Replit), open-source Lago (Mistral, Together), Amberflo — compete on flexibility; Stripe Billing's own usage product takes 0.7% of billing volume.

The strategic read: you will not have to build metering infrastructure from scratch, and you should not. What you do have to build is the discipline around it.

The pattern: a metering layer in front of what you already run

Nobody successful is doing rip-and-replace. The architecture that works is a mediation layer: meter where the work happens, rate and aggregate in a dedicated system, and feed rated results into the billing, CRM, and ERP systems you already own. Salesforce's own framing of the m3ter acquisition is exactly this — metering data flows across CRM, ERP, and quote-to-cash, not a new billing system. Confluent replaced a homegrown billing stack with a metering platform and eliminated weeks of manual work per pricing change; Ideogram stood up dynamic AI pricing in under a month with no internal billing team.

Pragmatically, finance-grade usage data means five properties, and they are requirements, not aspirations:

  1. One canonical event stream. Every monetizable action — tokens, agent runs, resolutions — emitted once, with customer ID, feature, model, and timestamp. Product analytics can sample; billing data cannot.
  2. Idempotent, immutable, replayable events, so a pipeline bug is a reprocessing job rather than a credit-memo apology tour.
  3. Rating separated from metering, with versioned, effective-dated price rules — so pricing can change quarterly without breaking the close or rewriting history.
  4. Cost joined to revenue at the same grain. The margin question — which customer, on which feature, at what model cost — must be answerable from one table. This is what the usage ↔ billing ↔ payments ↔ GL reconciliation layer exists to do.
  5. Entitlements as a service, not as if-statements. Plan limits, feature gates, and token budgets belong in a dedicated layer (Stigg, Schematic, or built equivalents) that product and commercial teams can change without an engineering release. This is also your kill switch when a customer's agent goes runaway.

Revenue recognition: credits are a liability until consumed

AI pricing has made prepaid credits the dominant enterprise structure, and credits have real ASC 606 mechanics. The practitioner consensus (Leapfin, Ordway, and peers — notably, the Big 4 have not yet published AI-token-specific guidance, so vendor-led practice is the current state of the art):

Margin protection: an engineering discipline with a finance dashboard

The cost side has a now-standard toolkit, and the savings are large enough to be strategy rather than housekeeping:

Maximum cost reduction by lever 0% 25% 50% 75% 100% Prompt caching −90% Model routing −35–85% Batch processing −50%
Caching: cached input priced at ~10% of list at the frontier labs (applies to re-sent prompt prefixes, not all tokens). Routing: RouteLLM benchmarks, 35–85% depending on task mix, retaining 95% of frontier quality. Batch: the universal 50% discount for non-latency-sensitive work.
The 90-day version

Days 1–30: Instrument the canonical usage event stream and stand up cost observability (gateway + tracing). Produce the first per-customer, per-feature AI margin report — however ugly.

Days 31–60: Pick the metering/rating layer (buy, not build) and run it shadow-mode against current invoices. Move entitlements out of code. Agree the ASC 606 credit policy with your auditors before the next pricing change ships.

Days 61–90: Turn on margin protection — routing, caching, budgets, alerts. Launch the repriced offer to a cohort, grandfathered and dashboard-first. Review the margin report with product, engineering, and finance in the same room, monthly, forever.

Part IV — The operating model: Product, Tech & Finance as one pricing team

The deeper change AI forces is organizational. Pricing used to be an annual marketing exercise. It is now a continuous, cross-functional control loop — the top 500 software companies with public pricing made 1,800+ pricing changes in 2025, roughly 3.6 each. A loop that fast cannot run through committees. It runs through clear ownership:

01Product owns the unit of value.

Product's job is to define monetizable use cases: which AI work maps to which billable unit — a resolution, an action, a task, a token bundle, a seat that happens to include AI. The test for each candidate unit: the customer can predict it, finance can recognize it, engineering can meter it, and a renewal conversation can defend it. Hybrid is the default answer (Bessemer calls it the effective middle ground); outcome units are earned one use case at a time, starting where attribution is clean. Product also owns the packaging ladder — which models, what context, how much priority each tier buys — because in AI products, the model menu is the packaging.

02Technology owns the meter and the margin.

Engineering's monetization deliverables are: the canonical event stream; entitlements enforcement at runtime; the gateway layer (routing, caching, failover, budgets); and cost attribution at the same grain as revenue. The infrastructure decisions — API vs open-weight, rent vs own, which providers — are now pricing decisions, made jointly with finance against the break-even math in Part II. The engineering culture shift: tokens are COGS, cache hit rate is a margin metric, and shipping an AI feature without a meter is shipping an unpriced liability.

03Finance owns truth and the guardrails.

Finance turns the meter into money: revenue-grade reconciliation from usage to GL, the credit/breakage policy agreed with auditors, forecasting models for consumption revenue (harder than seats — variable consideration, commit-and-drawdown dynamics, usage seasonality), and the per-customer margin dashboard that triggers repricing conversations. Finance also sets the guardrails product and engineering operate inside: minimum gross margin per offer, maximum unhedged exposure to a single model provider, spend thresholds where automated protection kicks in.

The cadence that binds the three: a monthly monetization review — usage trends, margin by customer and feature, cache and routing performance, pricing experiment results — and a quarterly pricing release, treated with the same discipline as a product release: versioned, tested, communicated, reversible.

What we'll grant the other side

We believe what we just wrote. We will also push back on ourselves.

The seat may outlive its obituary. Bain found no established vendor that fully abandoned seat pricing, and 35% succeeded by simply raising seat prices with AI bundled in. If your AI feature makes each human meaningfully more productive and your costs per user are bounded, a higher seat price is simpler, more predictable, and easier to sell than any meter. Complexity is a real price customers pay; do not spend it without need.

The cost crunch may ease. Constant-capability token prices fall 10–50x per year, and for many products, "good enough" intelligence keeps getting cheaper faster than usage grows. Margins are already recovering — 41% to 52% in two years on ICONIQ's data. The bet against this: agentic workloads are growing token consumption per task faster than unit prices fall at the frontier, and competitive pressure pushes everyone toward the newest, priciest models. We think the crunch persists for frontier-dependent products and eases for everyone else — which is itself an argument for the routing-and-hybrid strategy in this paper.

Outcome pricing can misalign as easily as align. A vendor paid per resolution is incentivized to mark conversations resolved; attribution disputes consume the trust the model was meant to build; and customers may rationally prefer a predictable bill over a fair one. The 72-hour confirmation windows and no-charge escalation rules are honest engineering around real failure modes — they are also evidence of how much scaffolding outcome pricing needs.

And metering can hurt. Cursor's backlash and GitHub Copilot users' 10–50x bills show the meter transferring risk to the customer. The companies that win with usage pricing spend as much on predictability — caps, alerts, included allowances, dashboards — as on the meter itself.

Why this matters now

Every actor in the system is repricing at once: the labs are raising flagship prices while open-weight models race to the floor; the platforms just spent billions buying metering engines; your customers' procurement teams are walking into 2026 renewals with usage dashboards of their own. The companies that thrive will be those that can see their cost and value at the same grain — and change price as routinely as they change code. The window in which "we're still figuring out AI pricing" is an acceptable answer is closing.

What we're committing to

The AI Impact Foundation will keep this analysis current — pricing tables re-verified against primary sources, the on-prem break-even math updated as GPU and API prices move — and will teach monetization in our courses the way this paper argues it should be practiced: as a cross-functional discipline with finance-grade data underneath it, not a slide in a board deck.

That's the work. Come help us do it.


Sources

Notes on confidence: GPU purchase prices are street estimates (no official list prices exist); OpenAI/Anthropic gross-margin figures are outside estimates of private companies from paywalled reporting and are presented as ranges; the Stripe–Metronome price (~$1B) is reported, not company-confirmed. All API and stack prices were fetched from primary vendor pages on June 10, 2026 and will drift — quickly.

If you're pricing AI — or being priced by it — let's talk.

We work with product, technology, and finance leaders building monetization that holds up in the margin column and the renewal meeting. If that's the problem on your desk, we'd like to hear from you.

Get in touch →