-640x427.png&w=3840&q=75)
24 Apr 2026 · 1 min read
A futuristic AI workspace shows coding tools, browser panels, research screens, and workflow dashboards merging into one unified digital system.
Nvidia and Google are both making a serious push to lower the cost of ai inference, and that could be one of the biggest shifts in the market this year. The real story is not only faster chips, but a broader systems race to make agents, enterprise ai, and physical ai affordable enough to run at scale.
For the past few years, most of the public conversation around artificial intelligence has been built around size, spectacle, and model launches. Bigger models, bigger data centres, bigger claims. But that phase was never going to last forever. At some point the market had to move from showing what ai can do to asking what it costs to keep doing it every minute of every day. That is where this latest Nvidia and Google push matters. At Google Cloud Next this week, Google rolled out a broader AI Hypercomputer expansion, including new TPU systems, new storage and networking upgrades, and a new A5X bare-metal offering built around NVIDIA Vera Rubin NVL72. Nvidia framed the same moment as a deeper joint infrastructure push for agentic and physical ai, with the headline promise that A5X is designed to deliver up to 10 times lower inference cost per token and 10 times higher token throughput per megawatt than the prior generation.
The important thing here is not just that two giant companies launched more hardware. The important thing is that both are now talking openly about the same pressure point. Inference is the stage where trained models actually do work for users. It is the part that answers queries, runs copilots, powers agents, processes business tasks, and keeps consumer apps alive. Google’s own positioning at Next made this plain. It described the industry as moving into an agentic era where one request can trigger many coordinated tasks, which creates far more complexity and can push costs sharply higher if infrastructure is not designed for it. That is why Google is not presenting these systems as luxury upgrades. It is presenting them as the base layer for making agentic services fast enough and cheap enough to run in the real world.
Training still matters, and it still draws huge investment, but inference is where scale becomes expensive in a much more relentless way. A model might be trained once or updated periodically, but inference happens every time a customer asks a question, every time an assistant summarizes a document, every time an enterprise agent checks a workflow, and every time a business tries to automate something that used to require a person. What this really means is that even modest efficiency gains at inference level can turn into enormous savings across millions or billions of requests. Google’s TPU 8i was introduced specifically as an inference and reinforcement learning system, with Google saying it delivers 80 percent better performance per dollar for inference than the prior generation. Nvidia, meanwhile, tied its Vera Rubin-based A5X roadmap directly to lower cost per token and better throughput per watt, which tells you exactly where enterprise demand is heading.
Latest
The latest industry news, interviews, technologies, and resources.
-640x427.png&w=3840&q=75)
24 Apr 2026 · 1 min read
A futuristic AI workspace shows coding tools, browser panels, research screens, and workflow dashboards merging into one unified digital system.
-640x427.png&w=3840&q=75)
23 Apr 2026 · 1 min read
One of the more telling parts of the Google announcement is that it is now separating the training story from the inference story more clearly. TPU 8t is the training machine. TPU 8i is the inference machine. That may sound technical, but the business meaning is simple. Google is recognising that the ai economy is maturing. Different workloads need different infrastructure, and companies do not want to overpay by using blunt instruments for every task. Google says TPU 8t delivers nearly three times the compute performance of the previous generation for training and can pack 9,600 chips into a single superpod, while TPU 8i is engineered for low-latency inference, larger on-chip memory, and lower lag during high-concurrency requests. This is not just a hardware upgrade. It is a sign that the market is moving from general-purpose excitement to more specialised cost control.
If Google is sharpening the distinction between training and inference, Nvidia is leaning into density and industrial scale. Nvidia says the new A5X bare-metal instances will use Vera Rubin NVL72 rack-scale systems and next-generation Google Virgo networking, supporting up to 80,000 Rubin GPUs in a single site cluster and up to 960,000 across multiple sites. That is a staggering number, but the point is not just size for its own sake. The point is that large language models and multi-agent systems get more expensive and less responsive when networking, memory, and data flow become bottlenecks. Nvidia and Google are both trying to solve that by treating the full stack as one problem rather than separate parts. Chips, networking, storage, orchestration, and software all have to work together if inference is going to be cheap enough to spread more widely.
A lot of headlines in ai infrastructure fixate on the chip name because it is easy to understand and easy to market. But the deeper story here may be the network fabric behind it. Google’s Virgo Network is described as a breakthrough data centre fabric with four times the bandwidth of previous generations. Google says it can connect 134,000 TPUs in a single data centre fabric and more than one million TPUs across multiple sites, while also being made available for A5X deployments with support for up to 80,000 GPUs in one data centre and 960,000 across multiple sites. What this really means is that scale is no longer only about how many accelerators a company can buy. It is about whether those accelerators can actually work together efficiently without introducing waste, lag, and idle time. Inference costs do not fall just because a chip is faster. They fall when the whole machine around the chip stops getting in the way.
The same pattern shows up in storage. Google announced that Managed Lustre now delivers 10 terabytes per second of bandwidth, which it says is a 10 times improvement over last year and up to 20 times faster than other hyperscalers. It also pointed to sub-millisecond latency in Rapid Buckets and said these changes help keep accelerators at 95 percent utilisation or higher during training checkpoints and recoveries. On paper that sounds like an infrastructure detail. In practice it is part of the same cost story. Idle accelerators are expensive accelerators. Every delay in storage, every stall in checkpointing, every badly timed recovery event pushes useful work down and effective cost up. This is where things change. The competition is no longer just who has the most powerful ai chips. It is who can keep the whole system fed, coordinated, and productive enough to reduce the real cost of running intelligence.
These announcements did not happen in a vacuum. Google is openly building around the rise of agents. In its Next keynote material, the company described Gemini Enterprise Agent Platform as a comprehensive system to build, scale, govern, and optimise agents, and paired that platform story directly with its new infrastructure story. The company positioned TPU 8i as a system for cost-effective, near-zero latency inference for agentic workloads, and said the platform brings together tools for agent design, orchestration, registry, identity, gateway, and observability. Nvidia matched that framing, saying NVIDIA Nemotron 3 Super is available on Gemini Enterprise Agent Platform and that Google Cloud and Nvidia are adding managed reinforcement learning support built with NeMo RL. This tells you the real destination. The companies are not only trying to support chatbots. They are preparing for many-task systems that keep working in the background, make decisions, call tools, and generate more inference demand than ordinary chat ever did.
Another important part of the announcement sits outside pure performance. Google Distributed Cloud now offers the latest Gemini Flash models in preview on NVIDIA Blackwell and Blackwell Ultra platforms for connected customers, aimed at organisations that need frontier models closer to sensitive data. Google says this is about data sovereignty and secure deployment within an organisation’s own perimeter. Nvidia added that Gemini on Google Distributed Cloud lets customers bring Google’s frontier models wherever their most sensitive data resides, and said Confidential G4 VMs with NVIDIA RTX PRO 6000 Blackwell GPUs are in preview for multi-tenant environments. Nvidia also described that as the first confidential computing offering of NVIDIA Blackwell GPUs in the cloud. That matters because cost is only one barrier to ai adoption. Trust, compliance, and security are the others. Lower inference costs help, but many big organisations will still hesitate unless they believe the models can run where the data needs to stay.
There is another way to read all of this. Google and Nvidia are trying to prevent a future where inference becomes the choke point that slows ai adoption. As more companies move from pilot projects to full production systems, the economics can get ugly fast. A flashy demo can be tolerated even if it is expensive. A real product used by customers all day cannot. Enterprises want ai that is fast, predictable, secure, and cost-controlled. If the cost per request stays too high, or if latency jumps under pressure, or if energy use balloons, then many of the grand claims around agents and ai transformation start running into ordinary business reality. This is why Google keeps using phrases like cost-effective, unified, and scalable, and why Nvidia is talking about token throughput per megawatt rather than only raw performance. They both know the market is moving from experimental wonder to operational discipline.
For a while it looked like the ai industry would be defined mostly by model makers. That still matters, but this week’s announcements show something else. The new battleground is systems design. It is about who can combine model access, networking, storage, orchestration, security, and hardware into something enterprises can actually live on. Google’s keynote made that point when it described Gemini Enterprise as the connective layer between data, people, apps, and agents. Nvidia made the same point from the infrastructure side, describing a co-engineered full-stack platform that spans performance-optimised libraries, frameworks, cloud services, and production systems. What this really means is that the most important ai companies of the next phase may not be the ones that only produce the smartest models. They may be the ones that make intelligence durable, governable, and affordable enough to be used everywhere.
It would also be a mistake to think this is only about digital assistants or office software. Nvidia’s statement tied the partnership to industrial and physical ai as well, including digital twins, robotics simulation, and factory optimisation using Omniverse libraries and Isaac Sim on Google Cloud Marketplace. That may sound like a separate field, but it rests on the same economics. Physical ai workloads still need reasoning, simulation, visual processing, and large-scale inference. They still need cheap tokens, fast networking, secure environments, and right-sized hardware. In other words, the fight over inference cost is not just about chat responses on a screen. It is also about whether ai can move into warehouses, factories, vehicles, logistics systems, and machine-heavy environments without becoming too costly or too brittle to justify.
The ripple effects go well beyond Google and Nvidia. If these infrastructure changes really push inference costs down, then software companies can afford to make their ai products more persistent, more proactive, and more deeply embedded into daily workflows. Startups can experiment with richer agent behaviour without burning money as quickly. Large enterprises can justify more automation across support, search, compliance, analytics, cybersecurity, and creative work. Google’s keynote included examples from companies using Gemini Enterprise agents across finance, retail, manufacturing, travel, insurance, consumer goods, and healthcare-adjacent work. Nvidia also pointed to customers such as Snap and Schrödinger using the joint platform to cut the cost of large-scale data processing and accelerate scientific workloads. The pattern is clear. Cheaper inference does not just improve one feature. It expands the number of places where ai becomes economically reasonable.
The next phase of ai will likely be shaped less by headline-grabbing demonstrations and more by invisible improvements in cost, latency, and orchestration. That is what makes this Nvidia and Google moment worth watching. Google has now built a clearer split between training and inference with TPU 8t and TPU 8i, while also deepening its partnership with Nvidia through Vera Rubin-based A5X systems, Virgo networking, and secure deployment options across public and distributed cloud. Nvidia, for its part, is aligning itself not just with training clusters but with the full operational life of ai, from inference to agents to physical systems. If these claims hold up in practice, the result could be simple but powerful. Ai stops feeling like a premium extra and starts feeling more like infrastructure. When that happens, the companies that lowered the cost of running intelligence may end up mattering just as much as the companies that first made it impressive
-300x200.png&w=3840&q=75)
GPT-5.5 makes ChatGPT look less like a chatbot and more like a future super app
1 min read · 24 Apr 2026

The Strait of Hormuz crypto scam is really a story about panic for sale
1 min read · 22 Apr 2026
-300x200.png&w=3840&q=75)
Europe’s mythos problem is really a security problem
1 min read · 21 Apr 2026
-1-300x200.png&w=3840&q=75)
India’s deepfake push is becoming a real test of how ai gets governed
1 min read · 21 Apr 2026

AI fundamentals. and how ai works
1 min read · 21 Apr 2026
-1-300x200.png&w=3840&q=75)
AI incident response is becoming the real test of responsible adoption
1 min read · 20 Apr 2026
-1-300x200.png&w=3840&q=75)
The safest AI question is the one most people still forget to ask
1 min read · 20 Apr 2026
-300x200.png&w=3840&q=75)
Bobyard 2.0 shows where construction AI is really heading
1 min read · 20 Apr 2026
-300x200.png&w=3840&q=75)
The evidence is in your pocket
1 min read · 19 Apr 2026
-300x200.png&w=3840&q=75)
Tesla takes its robotaxi push deeper into Texas
1 min read · 18 Apr 2026
X is shutting down Communities and shifting users toward XChat and Grok-powered Custom Timelines instead. That move may make the platform faster and easier to manage, but it also creates a clearer opening for creator-first platforms like v.social to position themselves as the place for real community-building