NVIDIA Research Unlocks Advanced Grasping, Smarter Autonomous Driving and Agent Training at Scale

NVIDIA Blog · Isha Salian · 2026-06-03

NVIDIA Research presents three CVPR 2026 papers: GraspGen-X, the first foundation model for zero-shot robotic grasping trained on 2 billion simulated grasps; LCDrive, which replaces text-based chain-of-thought in autonomous vehicles with latent representations using half the tokens; and NitroGen, an embodied agent foundation model trained across 1,000-plus video games.

Open original ↗

Appears in

AI Moving Beyond Screens into Physical Environments

Extraction

Topics: roboticsphysical-aiautonomous-vehiclesfoundation-modelsembodied-ai

Claims

GraspGen-X is the first foundation model for zero-shot grasping, eliminating per-gripper retraining by training on 2 billion simulated grasps across thousands of object shapes and synthetic gripper configurations.
LCDrive achieves comparable trajectory quality to text-based chain-of-thought reasoning in autonomous vehicles while using roughly half the tokens by reasoning in compressed latent space.
NitroGen, trained across 1,000-plus games and 40,000 hours of interaction, improves embodied agent performance by up to 52% over prior state-of-the-art in low-data conditions.
Training at scale on simulated or synthetic data is the common thread enabling generalization across diverse real-world physical AI applications.

Key quotes

Like a large language model that can apply its understanding of language to a new task without retraining, GraspGen-X applies its understanding of geometry and contact to any robotic gripper it encounters.

Instead of generating human-readable reasoning steps, the system thinks in a compact latent space — states that capture spatial information rather than producing text.

Trained across more than 1,000 games and 40,000 hours of interaction using a model based on GR00T, the resulting agents learn to generalize across environments.