NVIDIA CEO Jensen Huang sees agentic AI and AI-enabled robotics as a “multi-trillion dollar opportunity”
LAS VEGAS—Not unexpected but, in many ways, NVIDIA pretty well stole the show at CES this week. CEO Jensen Huang, following an introduction wherein Consumer Technology Association President Gary Shapiro didn’t once pronounce “NVIDIA” correctly, gave a tour de force keynote to a packed house at the Mandalay Bay. Huang essentially charted a course from AlexNet to perception artificial intelligence (AI) to generative AI (gen AI) to agentic AI and, in the future, physical AI.
What does that mean? AlexNet is a convolutional neural network (CNN) launched in 2012 that basically ushered in GPU-accelerated computer vision for image recognition at scale. Perception AI refers to AI systems that can “see” and interpret multi-modal information–images, audio, sensor data, and so on. We all (hopefully) know gen AI which is firmly established today and making sweeping commercial impacts in both the consumer and enterprise segments. According to Huang, we’re now on the cusp of “agentic AI. AIs that can perceive, reason, plan and act. And then the next phase…physical AI.” More on that later. But, big picture, from AlexNet to today, “Every single layer of the technology stack has been completely changed—an incredible transformation in just 12 years,” Huang said.
The three scaling laws of AI
“The industry is chasing and racing to scale artificial intelligence, and the scaling law is a powerful model,” Huang said. “It’s an empirical law that has been observed and demonstrated by researchers and industry over several generations. The scaling law says that the more data you have, the training data that you have, the larger model that you have, and the more compute you apply to it, therefore the more effective or the more capable your model will become. The scaling law continues.”
But two other scaling laws have emerged, he said. The post-training scaling law uses reinforcement learning and human feedback to help AI refine its skills. This is where domain-specific data can be used to fine-tune a large AI model in service of highly-specialized tasks. “It’s essentially like having a mentor or having a coach give you feedback after you’re done going to school.” The third scaling law is test-time scaling, or reasoning. “When you’re using the AI, the AI has the ability to apply a different resource allocation. Instead of improving its parameters, now it’s focused on deciding how much computation to use to produce the answers it wants to produce. Here AI, instead of outputting a one-shot inferencing-based answer, it breaks down a problem into multiple steps, generates multiple ideas or responses, then evaluates those. “Test-time scaling has proven to be incredibly effective,” Huang said.
He continued: “All of these systems are going [through] this journey step by step by step…The amount of computation that we need is, of course, incredible…Society has the ability to scale the amount of computation to produce more and more novel, and better, intelligence. Intelligence, of course, is the most valuable asset that we have and it can be applied to solve a lot of very challenging problems.” As such, the three AI scaling laws are driving enormous demand for NVIDIA GPU-accelerated computing and associated solutions.
Huang: “One of the most important things that’s happening in the world of enterprise is agentic AI”
“Agentic AI basically is a perfect example of test-time scaling. It’s a system of models. Some of it is understanding, interacting with the customer, interacting with the user, some of it is maybe retrieving information from storage, a semantic AI system like a RAG, maybe it’s going onto the internet, maybe it’s studying a PDF file…It might be using tools. It might be using a calculator, it might be using a generative AI to generate charts and such. And it’s iterative.” Agentic AI is empowered, by the user, to take a problem and solve it using whatever means it has access to; it’s going beyond answering questions to help you solve a problem to understanding your intent and delivering an outcome because you’ve given it the agency to do that.
Huang said NVIDIA’s go-to-market strategy for agentic AI is to work with leading IT vendors, he name-checked Cadence, SAP and ServiceNow–to integrate his company’s technologies into the partner offerings. At CES, he announced new NVIDIA AI Blueprints meant to facilitate customization of agents, “knowledge robots,” he called them, that can reason, plan and take actions. The blueprints integrate the NVIDIA AI Enterprise software platform, including NVIDIA NIM AI microservices and NVIDIA NeMo, and end-to-end gen AI model development platform. To give an example, NVIDIA announced a blueprint that converts PDFs to podcasts. Read the details here.
Big picture, “In the future these AI agents are essentially digital workforce that are working alongside your employees, doing things on your behalf.” Agents will be onboarded in a similar fashion to how human talent is onboarding in terms of training on company-specific vernacular, processes, way or working, etc…“In a lot of ways, the IT department of every company is going to be the HR department of AI agents in the future. Today they manage to maintain a bunch of software…In the future they’ll maintain, nurture, onboard and improve a whole bunch of digital agents and provision them to the companies to use.”
Embodied cognition and physical AI
Before getting into NVIDIA’s approach to physical AI, let’s talk about embodied cognition. Embodied cognition is a theory that suggests that a body having sensory experiences and interacting with a physical environment is necessary to facilitate learning, thinking and reasoning. For AI that means the AI systems needs to include a physical form for that AI system to truly develop the ability to think and act—to say that another way, for AI to truly be agentic, it needs physicality. See Rodney Brooks’ “Intelligence without representation,” The Embodied Mind: Cognitive Science and Human Experience by Elenor Rosch, Evan Thompson and Francisco Varela, and Philosophy in the Flesh: The Embodied Mind and Its Challenge to Wester Thought by Mark Johnson and George Lakoff.
OK. Back to Huang’s CES keynote: “Imagine, whereas your large language model, you give it your context, your prompt, on the left, and it generates tokens one at a time to produce the output. That’s basically how it works.” But also imagine that, instead of your prompt being a query that can be answered through an expansive web search summarized by a gen AI tool, that your prompt is telling a robot to go pick up a box. In this case tokens don’t produce text, they produce action. “That,” Huang said, “what I just described, is a very sensible thing for the future of robotics.”
To expand on that comparison to a large language model that uses text, semantic relationships, RAG, and other techniques to output text, the model an AI-enabled robot would rely on would need an understanding of physical dynamics like gravity, friction, inertia, geometric and spatial relationships, cause and effect, and object permanence among other tangible inputs. “All of these types of intuitive understanding that we know, most models today have a very hard time with. We need a world foundation model.”
And with that casual setup for what’s a deeply philosophical, technological paradigm shift, Huang announced NVIDIA Cosmos, a platform for physical AI development using a family of world foundation models, described by the company as “neural networks that can predict and generate physics-aware videos of the future state of a virtual environment…to help developers build next-generation robots and autonomous vehicles.”
Cosmos is trained on 20 million hours of video focused on dynamic, physical things—nature, humans walking, hands manipulating objects…“It’s really about teaching the AI not about generating creative content, but teaching the AI to understand the physical world. With this physical AI, there are many downstream things we can do as a result.” In particular, robots. There’s also a sort of Doctor Strange, multiverse angle here that I’ll have to cover some other time. Huang also noted that Cosmos will really shine when connected to NVIDIA Omniverse, to use 3D outputs and diffusion models to generate synthetic video data that, in turn, can be used to train physical AI models. The combination, he said, “gives you a physically simulated, a physically grounded, multiverse generator.” Read more here. And also here.
“The ChatGPT moment for general robotics is just around the corner,” Huang said. “And in fact, all of the enabling technologies that I’ve been talking about is going to make it possible for us in the next several years to see very rapid breakthroughs, surprising breakthroughs, in general robotics.”
There was a lot more in Huang’s keynote that I’ll endeavor to work through but, suffice to say, I’ve been to a lot of CESs, CES keynotes and CES press conferences, and this is the first time I thought to myself, “Holy shit.”