RoboAfford++: A Generative AI-Enhanced Dataset for Multimodal Affordance Learning in Robotic Manipulation and Navigation

Best Paper Award & Best Poster Award, IROS 2025 RoDGE Workshop
Xiaoshuai Hao1, Yingbo Tang2,*, Lingfeng Zhang3, Yanbiao Ma4, Yunfeng Diao5,
Ziyu Jia2, Wenbo Ding3, Hangjun Ye1, Long Chen1,†
1Xiaomi EV    2Institute of Automation, Chinese Academy of Sciences
3Tsinghua Shenzhen International Graduate School, Tsinghua University
4Gaoling School of Artificial Intelligence, Renmin University of China
5School of Computer Science and Information Engineering, Hefei University of Technology

*Corresponding author    Project leader

RoboAfford++ Poster

RoboAfford++ Poster

Click image to view full size

RoboAfford: A Dataset and Benchmark for Enhancing Object and Spatial Affordance Learning in Robot Manipulation

Yingbo Tang1,2*, Lingfeng Zhang3,*, Shuyi Zhang1,2, Yinuo Zhao4, Xiaoshuai Hao5,†
1Institute of Automation, Chinese Academy of Sciences
2School of Artiffcial Intelligence, University of Chinese Academy of Sciences
3Shenzhen International Graduate School, Tsinghua University
4Beijing Institute of Technology
5Beijing Academy of Artificial Intelligence (BAAI)

*Co-first Authors    Corresponding Author
NaVid

This work introduces RoboAfford, a novel large-scale dataset designed to enhance object and spatial affordance learning in robot manipulation. It encompasses three key dimensions: object affordance recognition, object affordance prediction, and spatial affordance localization.

Abstract

Robot manipulation is a fundamental capability of embodied intelligence, enabling effective robot interactions with the physical world. In robot manipulation tasks, predicting precise grasping positions and object placement is essential. Achieving this requires object recognition to localize target object, predicting object affordances for interaction and spatial affordances for optimal arrangement. While Vision-Language Models (VLMs) provide insights for high-level task planning and scene understanding, they often struggle to predict precise action positions, such as functional grasp points and spatial placements. This limitation stems from the lack of annotations for object and spatial affordance data in their training datasets.

To address this gap, we introduce RoboAfford, a novel large-scale dataset designed to enhance object and spatial affordance learning in robot manipulation. Our dataset comprises 819,987 images paired with 1.9 million question answering (QA) annotations, covering three critical tasks: object affordance recognition to identify objects based on attributes and spatial relationships, object affordance prediction to pinpoint functional grasping parts, and spatial affordance localization to identify free space for placement. Complementing this dataset, we propose RoboAfford-Eval, a comprehensive benchmark for assessing affordance-aware prediction in real-world scenarios, featuring 338 meticulously annotated samples across the same three tasks. Extensive experimental results reveal the deficiencies of existing VLMs in affordance learning, while fine-tuning on the RoboAfford dataset significantly enhances their affordance prediction in robot manipulation, validating the dataset's effectiveness. The dataset, benchmark and evaluation code will be made publicly available to facilitate future research.

Dataset Construction

NaVid

Pipeline for constructing the RoboAfford dataset. We first discard the image with densely repeated objects, and then generate question answering pairs using human designed template or GPT-4o.

Comparison of Existing Affordance Datasets

NaVid

Obj-Aff: Object Affordance. Spa-Aff: Spatial Affordance.

Methodology

NaVid

Our RoboAfford-Qwen Framework. We fine-tune the model on the RoboAfford dataset to enhance object and spatial affordance capabilities. For robotic manipulation, we use depth information to transform 2D points representing objects and spatial affordances into 3D coordinates, which are then converted to end-effector positions for robotic manipulation.

Experimental Results

Comparison results of various VLMs on RoboAfford-Eval benchmark.

NaVid

Real-world Experiments

NaVid

Qualitative results of RoboAfford-Qwen, where cyan points indicate the object and spatial affordances.

Robot Manipulation

NaVid

Results of deploying RoboAfford-Qwen to downstream robotic manipulation tasks.

Citation

RoboAfford++

@article{hao2025roboafford++,
  title={RoboAfford++: A Generative AI-Enhanced Dataset for Multimodal Affordance Learning in Robotic Manipulation and Navigation},
  author={Hao, Xiaoshuai and Tang, Yingbo and Zhang, Lingfeng and Ma, Yanbiao and Diao, Yunfeng and Jia, Ziyu and Ding, Wenbo and Ye, Hangjun and Chen, Long},
  journal={arXiv preprint arXiv:2511.12436},
  year={2025}
}

RoboAfford

@inproceedings{tang2025roboafford,
  title={RoboAfford: A Dataset and Benchmark for Enhancing Object and Spatial Affordance Learning in Robot Manipulation},
  author={Tang, Yingbo and Zhang, Lingfeng and Zhang, Shuyi and Zhao, Yinuo and Hao, Xiaoshuai},
  booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
  pages={12706--12713},
  year={2025}
}

License

The datasets and benchmarks are under the Creative Commons Attribution 4.0 International License.