开源具身智能模型 RoboBrain 2.0 使用指南

RoboBrain 2.0 是一个开源具身智能模型，能提升机器人在多智能体任务规划、空间推理和闭环执行方面的能力，通过结合视觉、语言和结构化场景图信息，支持交互式推理、空间感知、时间感知和场景推理。RoboBrain 2.0模型在多个具身智能基准测试中表现出色，展示了在复杂环境中的推理和规划能力，提供代码示例，用于执行简单的推理任务，包括图像理解、可供性预测、轨迹预测和定位预测，还有用于导航任务的用法。

长周期任务规划

可对复杂任务进行分解。例如，“我饿了，想点一个普通汉堡”，规划步骤为：抓取（底部面包）→ 放置到（篮子）→ 抓取（生菜片）→ 放置到（面包）→ … → 放置到（奶酪）→ 移动到（厨房桌子）→ 抓取（篮子）→ 移动到（顾客桌子）。

闭环反馈和监控

能实时监控任务执行状态。如“请给我一个橙子和一把刀”，系统会先确定橙子和刀的位置都在厨房桌子上，然后监控抓取状态、到达状态、任务完成度（如完成80%）、物体可用性等，确保任务顺利执行。

空间感知

点指定位：可根据指令精准定位空间中的点。比如“请指出浴缸和马桶之间的空闲空间”，系统会输出具体坐标点。

边界框定位：当收到“走向容器，并提供轨迹”等指令时，能准确确定目标物体的边界框位置。

时间感知

可对未来轨迹进行估计。例如“拿起刀，并提供轨迹”，系统会生成一系列坐标点，形成从当前位置到目标位置的运动轨迹。

场景推理

通过实时构建和更新结构化场景记忆进行推理。比如在厨房场景中，记忆包含厨房桌子的类型、上面的物体以及机器人等信息，帮助机器人更好地理解和执行任务。

RoboBrain 2.0 架构设计

RoboBrain 2.0 架构支持多图像、长视频和高分辨率视觉输入，语言侧的复杂任务指令和结构化场景图。视觉输入通过视觉编码器和 MLP 投影仪处理，文本输入被标记为统一的标记流，所有输入都被送入 LLM 解码器，进行长链式思维推理，输出结构化计划、空间关系以及相对和绝对坐标。

模型动物园

模型	检查点	描述
RoboBrain 2.0 7B	BAAI/RoboBrain2.0-7B	70亿参数版本的 RoboBrain2.0
RoboBrain 2.0 32B	BAAI/RoboBrain2.0-32B	320亿参数版本的 RoboBrain2.0（即将发布）

RoboBrain 2.0环境搭建

1、克隆仓库

git clone https://github.com/FlagOpen/RoboBrain2.0.git
cd RoboBrain

2、构建 conda 环境

conda create -n robobrain2 python=3.10
conda activate robobrain2
pip install -r requirements.txt

RoboBrain 2.0 推理示例

1、无需思考的预测

from inference import SimpleInference
model = SimpleInference("BAAI/RoboBrain2.0-7B")

prompt = "What is shown in this image?"
image = "http://images.cocodataset.org/val2017/000000039769.jpg"
pred = model.inference(prompt, image, task="general", enable_thinking=False, do_sample=True)
print(f"Prediction:\n{pred}")

预测示例：

{
    "thinking": "",
    "answer": "Two cats sleeping side by side on a couch."
}

2、带思考过程的预测（测试版）

from inference import SimpleInference
model = SimpleInference("BAAI/RoboBrain2.0-7B")

prompt = "What is shown in this image?"
image = "http://images.cocodataset.org/val2017/000000039769.jpg"
pred = model.inference(prompt, image, task="general", enable_thinking=True, do_sample=True)
print(f"Prediction:\n{pred}")

预测示例：

{
    "thinking": "Upon examining the visual input, I observe two cats resting comfortably on a pink blanket that covers a couch or sofa、The cats are lying side by side, with one on top of the other, indicating their relaxed state and possibly their close bond、Their positions suggest they feel safe and at ease in their environment.\n\nWith my advanced visual processing capabilities, I can identify various objects within this scene, such as the pink blanket beneath the cats and the couch they are lying on、Additionally, there appear to be remote controls nearby, potentially situated on or near the couch, which further confirms that this is an indoor setting where people relax and spend leisure time.",
    "answer": "The image shows two cats lying on a pink blanket on a couch."
}

3、 affordance 预测

from inference import SimpleInference
model = SimpleInference("BAAI/RoboBrain2.0-7B")

# 示例：
prompt = "hold the cup"
image = "./assets/demo/affordance.jpg"
pred = model.inference(prompt, image, task="affordance", plot=True, enable_thinking=True, do_sample=True)
print(f"Prediction:\n{pred}")

预测示例：

{
    "thinking": "From the visual input, the object is recognized as a white ceramic cup with a handle on its side、It appears cylindrical with an open top and has sufficient height for a standard drinking cup、The handle is positioned to one side, which is crucial for grasping、The cup rests on a wooden surface, suggesting stability due to its weight and material solidity.\n\nMy end-effector is equipped with a gripper capable of securely engaging objects of this size and shape, specifically designed for cylindrical and handle-like surfaces、Given my capabilities, I can adjust the grip to accommodate the handle's size and curve、The gripper can easily access the handle area without disturbing the cup's balance on the flat surface.\n\nThe current task is to hold the cup, which necessitates securely gripping it by the handle or potentially enveloping the body if necessary、The cup’s position on the table, within reach, allows me to approach from the left side toward the handle, ensuring optimal leverage for lifting、\n\nVerifying the handle's suitability, it seems sufficiently robust and unobstructed to enable a reliable grip、My sensors will ensure that the force applied through the gripper doesn't exceed the cup's weight and stability limits.\n\nTherefore, the cup's affordance area is [577, 224, 638, 310]、This is because the handle provides a clear and accessible spot for my gripper to engage securely, fulfilling the task requirement to hold the cup effectively.",
    "answer": "[577, 224, 638, 310]"
}

训练和评估

训练框架采用北京人工智能研究院框架研发团队开发的分布式训练框架 FlagScale 进行训练，具体步骤可参考 QuickStart.md 来训练基础指令模型或对 RoboBrain2.0 进行微调。

评估框架使用 FlagEval 团队开发的灵活评估框架 FlagEvalMM 进行全面的多模态模型评估，评估步骤如下：

1、参考 FlagEvalMM 中的说明进行安装、配置和数据准备。

2、执行评估命令（示例）：

flagevalmm --tasks tasks/where2place/where2place.py \
        --exec model_zoo/vlm/api_model/model_adapter.py \
        --model BAAI/RoboBrain2.0-7B \
        --num-workers 8 \
        --output-dir ./results/RoboBrain2.0-7B \
        --backend vllm \
        --extra-args "--limit-mm-per-prompt image=18 --tensor-parallel-size 4 --max-model-len 32768 --trust-remote-code --mm-processor-kwargs '{\"max_dynamic_patch\":4}'"

RoboBrain2.0 - 32B 在四个关键具身智能基准测试（BLINK - Spatial、CV - Bench、EmbSpatial 和 RefSpatial）上取得了最先进的性能，不仅超越了 o4 - mini 和 Qwen2.5 - VL 等领先的开源模型，还超过了 Gemini 2.5 Pro 和 Claude Sonnet 4 等闭源模型，在具有挑战性的 RefSpatial 基准测试中，RoboBrain2.0 表现出超过 50% 的绝对改进。

▶ 访问