Model Overview

Description:

NVIDIA Isaac GR00T N1.7 is an open foundation model for generalized humanoid robot reasoning and skills. This cross-embodiment model takes multimodal input, including language and images, to perform manipulation tasks in diverse environments. Developers and researchers can post-train GR00T N1.7 with real or synthetic data for their specific humanoid robot or task.

Isaac GR00T N1.7 is the medium-sized version of our model built using pre-trained vision and language encoders, and uses a flow matching action transformer to model a chunk of actions conditioned on vision, language and proprioception.

A detailed description of the Isaac GR00T N1.X architecture is provided in the GROOT N1 White Paper (https://arxiv.org/abs/2503.14734).

This model is ready for commercial/non-commercial use.

Model Developer: NVIDIA

Model Versions

The Isaac GR00T N1.7 model family includes the following 4 models:

GR00T N1.7 – SimplerEnv Bridge

Description
N1.7 post-trained model using the Bridge Dataset in SimplerEnv.

Post-Training Data
https://huggingface.co/datasets/IPEC-COMMUNITY/bridge_orig_lerobot

Dataset Summary
A LeRobot-format conversion of BridgeData V2, originally containing 60,096 trajectories of robot manipulation across 24 environments.

GR00T N1.7 – SimplerEnv Fractal

Description
N1.7 post-trained model using the Fractal Dataset in SimplerEnv.

Post-Training Data
https://huggingface.co/datasets/IPEC-COMMUNITY/bridge_orig_lerobot

Dataset Summary
A LeRobot-format conversion of BridgeData V2, originally containing 60,096 trajectories of robot manipulation across 24 environments.

GR00T N1.7 – Droid

Description
N1.7 post-trained model using the DROID Dataset.

Post-Training Data
https://droid-dataset.github.io/

Dataset Summary
A large-scale “in-the-wild” robot manipulation dataset with approximately 76,000 demonstration trajectories (~350 hours) of interaction data, collected across 564 distinct scenes in 52 buildings, covering 86 manipulation tasks from natural-language instructions.

GR00T N1.7 – LIBERO

Description
N1.7 post-trained model using the LIBERO Dataset.

Post-Training Data
https://github.com/Lifelong-Robot-Learning/LIBERO

Dataset Summary
A benchmark for lifelong robot learning, providing 130 language-conditioned manipulation tasks grouped into multiple task suites.
Includes human-teleoperated demonstrations designed to evaluate knowledge transfer and continual learning in robotic agents.

License

This model is released under the NVIDIA Open Model License Agreement.

Deployment Geography:

Global

Use Case:

Researchers, Academics, Open-Source Community: AI-driven robotics research and algorithm development. Developers: Integrate and customize AI for various robotic applications. Startups & Companies: Accelerate robotics development and reduce training costs.

Release Date:

Github via https://github.com/NVIDIA/Isaac-GR00T
Huggingface via https://huggingface.co/collections/nvidia/gr00t-n17

Computational Load (Internal Only: For NVIDIA Models Only)

Cumulative Compute: Follow Instructions Estimated Energy and Emissions for Model Training: Follow Instructions Total kWh: 64 GB200 nodes * 4 gpus per node x 1200W x 0.001 x 0.8 x 120 hours * 1.4 = 41288 kWh Total Emission: 410.5 * 41288 * 0.000001 = 16.949 tCO2e

Model Architecture:

GR00T-N1.7 VLM backbone is now Cosmos-Reason2-2B

Network Architecture:

The schematic diagram is shown in the illustration above. Red, Green, Blue (RGB) camera frames are processed through a pre-trained vision transformer (SigLip2). Text is encoded by a pre-trained transformer (T5) Robot proprioception is encoded using a multi-layer perceptron (MLP) indexed by the embodiment ID. To handle variable-dimension proprioception, inputs are padded to a configurable max length before feeding into the MLP. Actions are encoded and velocity predictions decoded by an MLP, one per unique embodiment. The flow matching transformer is implemented as a diffusion transformer (DiT), in which the diffusion step conditioning is implemented using adaptive layernorm (AdaLN).

Number of Model Parameters: 3,000,000,000

Input:

Input Type(s): -Vision: Image Frames -State: Robot Proprioception -Language Instruction: Text -Embodiment ID: Integer

Input Format: -Vision: Variable number of uint8 image frames, coming from robot cameras -State: Floating Point -Language Instruction: String -Embodiment ID: Integer indicating which of the training embodiments is observed

Input Parameters: -Vision: Two-Dimensional (2D) - Red, Green, Blue (RGB) -State: One-Dimensional (1D) - Floating number vector -Language Instruction: One-Dimensional (1D) - String -Embodiment ID: One-Dimensional (1D) - Integer

Output:

Output Type(s): Actions

Output Format Continuous-value vectors

Output Parameters: [Two-Dimensional (2D)]

Other Properties Related to Output: Continuous-value vectors correspond to different motor controls on a robot, which depends on Degrees of Freedom of the robot embodiment.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s): PyTorch

Supported Hardware Microarchitecture Compatibility: All of the below:

NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Jetson
NVIDIA Hopper
NVIDIA Lovelace

[Preferred/Supported] Operating System(s):

Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version

GR00T N1.7 EA

Training and Evaluation Datasets:

The total size (in number of data points): 21.6 million
Total number of datasets: 13

Training Dataset:

GR00T Pretraining Data

Data Collection Method by dataset: Hybrid: Human, Robot, Simulated.

Labeling Method by dataset: Hybrid: Human, Automated.

Properties:

Cross-embodiment: Data collected on various robot embodiments
Sensor types: RGB camera, robot proprioception, robot actuator data

Evaluation:

We evaluate in both simulation and real robot benchmarks, as defined in the White Paper (https://arxiv.org/abs/2503.14734).

Data Collection Method by dataset: Hybrid: Human, Robot, Simulated.

Labeling Method by dataset: Hybrid: Human, Automated.

Sim evaluation benchmarks for upper body control
9 DexMG Whitepaper tasks
24 RoboCasa simulated mobile manipulator tasks
24 Digital Cousin simulated GR-1 humanoid manipulation tasks
For sim, we automatically measure the success rate in each manipulation behavior.
For real robot
- Grocery packing task
- Novel objects (unseen from training data)
- Industrial multi-robot coordination with handoffs
- Evaluated by human observers in the lab

Inference:

Engine: PyTorch Test Hardware: A6000

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.

For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Downloads last month: 665

Safetensors

Model size

3B params

Tensor type

BF16

Video Preview

Robotics

Collection including nvidia/GR00T-N1.7-DROID

GR00T-N1.7

Collection

NVIDIA Isaac GR00T N1.7 open vision-language-action (VLA) model for generalized humanoid • 5 items • Updated 1 day ago • 4

Paper for nvidia/GR00T-N1.7-DROID

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Paper • 2503.14734 • Published Mar 18, 2025 • 7