The era of unlimited AI subscriptions is ending. Agentic workflows that run autonomously for hours burn through far more tokens than human users ever did, making flat-rate pricing unsustainable for providers. As a result, billing is shifting from monthly plans to usage-based models tied to token consumption.
Why providers are moving away from flat rates
On June 1, 2026, GitHub Copilot began rolling out a usage-based model called "GitHub AI Credits." These credits are tied to actual token usage and the API prices of each model. They apply when Copilot does more than suggest code, mainly in chat, CLI, and agent features. Standard completions remain free under paid plans.
GitHub explained that a short chat question used to be treated the same as an autonomous coding session that runs for hours. That model could not last.
Anthropic is also drawing a sharper line between normal use and agentic workflows. Products like Claude Code, Claude Cowork, and Managed Agents turn its AI into a digital worker. Anthropic blamed bottlenecks at Claude Code on peak loads and contexts of up to one million tokens. Older plans fit heavy chat use but not always-on agent workflows.
Nearly half of all agentic tool calls on Anthropic's public API go to software development, according to the company's own analysis. Customer service, sales, finance, and e-commerce each sit at a few percent. Simple chat requests still dominate there, but that spread will likely widen as agentic workflows mature in other fields.
Why the token price alone is misleading
A flat price comparison between models can be deceptive. GPT-5.5 costs $30 per million output tokens while DeepSeek V4 Pro costs 87 cents. But the real cost depends on consumption per task. A cheap model can get expensive if it needs more tries, fails often, or requires cleanup. A pricier model pays off if it reaches the goal with fewer loops and less human oversight.
OpenRouter's analysis of real-world usage showed that GPT-5.5, despite being designed to give shorter answers, led to cost increases of 49 to 92 percent over its predecessor, depending on input length.
Both the token price and the number of tokens consumed can rise together. Google's Gemini 3.5 Flash saw its token price triple over Gemini 3 Flash. In Artificial Analysis's evaluation, the model also needed more steps in the Intelligence Index run, making it more expensive than Google's current flagship Gemini 3.1 Pro in that test.
DeepSeek pushes prices down with rock-bottom rates. Its bet is that if you pay only a fraction per token, you can run the same job four or five times and still come out cheaper. But if the final result does not hold up, rework quickly eats the price advantage.
How the token market is splitting by performance class
The market is no longer about a single token price. A fast token in a coding agent, a cheap token in a mass-market app, and a specialized token in security analysis are different economic products. Providers are selling different inference services, not just compute time. The scarcer, faster, or more valuable that service is, the further the price can drift from raw costs.
Nvidia CEO Jensen Huang explained this in two recent interviews. On Dwarkesh Patel's show, he said Nvidia recently licensed the inference architecture of startup Groq and folded it into its own CUDA ecosystem. The reason is economic: the value of a token has risen so much that different prices for different token types now make sense.
Huang noted that in the past tokens were barely expensive. But now different customers want different answers. If software engineers can be given much more responsive tokens so that they become more productive, companies will pay for it. Premium inference with lower latency pays off because tokens at the top of the market can command higher prices.
Where value comes from possible outcomes, more segmentation is possible. According to The Information, Palo Alto Networks tested Anthropic's security model Mythos to scan its own source code for vulnerabilities. The model reportedly found more than two dozen critical vulnerabilities in about three weeks, roughly five times as many as existing methods. The test racked up token costs in the millions, but those costs can be rational if the security holes found would cost many times more if exploited.
British biotech company Basecamp Research wants to scale its biological AI dataset from 10 billion to one trillion genes and other data points with its "Trillion Gene Atlas" project. The dataset is proprietary. If such models deliver solid intermediate products like drug candidates, the token run can't be compared to a chat or coding reply.
Huang told Lex Fridman that computers used to be warehouses for data but today they are factories for tokens. Like every factory, this one produces several products at the same time. He sees a market with clearly tiered segments taking shape, where someone willing to pay $1,000 per million tokens is "just around the corner."
The productivity gap and the temptation of tokenmaxxing
Agentic AI is billed by usage, and token prices are splitting by performance class. The cost side becomes more precise, higher, and more visible. That sharpens the question of whether AI saves time or pays off. But costs can be measured ever more exactly while benefits often stay vague.
Uber shows how hard attribution gets inside a single company. According to Fortune, Uber burned through its planned 2026 AI coding tools budget in just four months. Uber COO Andrew Macdonald questioned whether rising use of Claude Code clearly translates into more useful consumer features. Token costs are known down to the cent, but whether they turn into products that users need and that show up positively on the bottom line is an open question.
SemiAnalysis calls this "Dark Output." AI may be doing economically valuable work that barely shows up in traditional statistics. When tasks once paid for as consulting hours or legal services move into internal AI workflows, the token costs stay measurable but the value no longer appears as its own transaction in GDP.
Stay updated
Get the day's AI and automation news in your inbox. No spam, unsubscribe anytime.
Out of this measurement gap comes a pragmatic stopgap: tokenmaxxing. This is the assumption that more AI use automatically brings more benefit. The only reliable measure of "more AI" is token usage, but that measures activity, not outcome. An agent that spends two hours solving a task wrong burns more tokens than one that solves it correctly in five minutes. In tokenmaxxing logic, the first would look more productive.
Agentic AI makes the problem worse in two ways. Consumption rises massively, and the immediate human quality check falls away. In chat, the user sees the answer right away. An agent runs autonomously for minutes or hours and delivers a result at the end that may need to be fixed or thrown out. Until then, token usage is the only signal.
Why agentic AI needs clear task framing
If token usage alone is not a reliable steering metric, control has to start with the task itself. A failed attempt in an agent is much more expensive than a bad prompt in chat. If a run breaks off after two hours with no result, the tokens are still gone.
Agentic AI needs clear task framing: what should be solved, which data and tools are allowed, when does a human review, when does the agent abort, and what can the attempt cost? Every company knows this logic from working with freelancers. An editor does not tell a freelance writer to just write no matter how long it takes.
An example: "Review this pull request with the standard model. If you spot security-relevant changes, escalate only the relevant files and hunks to the more expensive review model. Before each call, abort if the input context exceeds 200,000 tokens. Track cumulative input and output tokens, and stop if the review exceeds the token budget."
Setting limits like that is hard because consumption is tough to estimate in advance. Values must be built up empirically per use case. The example also contains the practical answer to token segmentation: using a cheap standard model for routine work and only escalating to a pricey specialist model when needed. Early Mythos testers already report this kind of routing approach.
Four symptom patterns in operations
Once task framing and routing are in place, one question remains: how do you tell during operations whether a workflow is actually working? Four patterns can be distinguished.
High usage with a usable result is the most unremarkable case. The task gets done but more expensively than necessary. The causes usually lie in routing: a frontier model for a task a smaller one could have handled or missing caching.
High usage with a bad result is the biggest risk of the agentic era. Money is burned without anything usable at the end. The cause is rarely in one spot; unclear task framing, the wrong model class, and missing abort rules usually overlap.
Low usage with high rework means tokens are cheap because the model answers fast, but every output has to be reworked by humans at length. The costs shift from the token bill to payroll. This pattern is deceptive because the token bill looks like a success.
Usage without attributable value means token costs show up on the balance sheet but nobody can say which process contributed what. Work that used to be done differently or externally moves into internal token costs and vanishes from value attribution.
Where the token economy could go
The future depends on how fast companies learn to steer AI work. Three scenarios follow.
In the baseline scenario, big providers roll out the hybrid model of base subscription plus usage-based credits across the board. Companies gradually build FinOps structures for AI and experiment with model routing. Premium segments emerge in tightly bounded fields without flipping the broad market.
In the acceleration scenario, agent models and tool integration improve faster than expected, and autonomous workflows spread quickly beyond software development. Token market segmentation speeds up, and differentiated prices eventually lead to outcome-based pricing like pay per pull request or pay per vulnerability.
In the slowdown scenario, cases like Uber pile up where AI budgets explode without clear benefit. CFOs set harder limits and delay rollouts. Providers come under pressure to guarantee quality or cut prices.
The most likely scenario is the baseline. The shift to usage-based models is already decided or underway at the big providers. Cases like Uber and cost jumps at GPT-5.5 or Gemini 3.5 Flash show that companies still have to build steering competence. A real slowdown is unlikely because of investment pressure and early evidence of benefits in software development.
Our take
In the agent era, the token becomes a business metric, comparable to the fuel consumption of a trucking company. To run economically, you have to know how many liters each trip burns, which trip needs which fuel, and which trip is even worth taking. The companies that master this economy are the ones that can answer one question: which work are we buying with which tokens, and how do we know it was worth it?
Related on Neura Market
- AI Tools Directory - Browse tools for managing AI agent costs and workflows
- Automation Marketplace - Find automation solutions for token-intensive tasks
- AI Agents Directory - Explore agent platforms that require usage-based billing

