RoboAfford++: A Generative AI-Enhanced Dataset for Multimodal Affordance Learning in Robotic Manipulation and Navigation

Best Paper Award & Best Poster Award, IROS 2025 RoDGE Workshop

Xiaoshuai Hao¹, Yingbo Tang^2,*, Lingfeng Zhang³, Yanbiao Ma⁴, Yunfeng Diao⁵,
Ziyu Jia², Wenbo Ding³, Hangjun Ye¹, Long Chen^1,†
¹Xiaomi EV ²Institute of Automation, Chinese Academy of Sciences
³Tsinghua Shenzhen International Graduate School, Tsinghua University
⁴Gaoling School of Artificial Intelligence, Renmin University of China
⁵School of Computer Science and Information Engineering, Hefei University of Technology
^*Corresponding author ^†Project leader

arXiv Code Dataset Benchmark

RoboAfford++ Poster

Click image to view full size

RoboAfford: A Dataset and Benchmark for Enhancing Object and Spatial Affordance Learning in Robot Manipulation

Yingbo Tang^1,2*, Lingfeng Zhang^3,*, Shuyi Zhang^1,2, Yinuo Zhao⁴, Xiaoshuai Hao^5,†
¹Institute of Automation, Chinese Academy of Sciences
²School of Artiffcial Intelligence, University of Chinese Academy of Sciences
³Shenzhen International Graduate School, Tsinghua University
⁴Beijing Institute of Technology
⁵Beijing Academy of Artificial Intelligence (BAAI)
^*Co-first Authors ^†Corresponding Author

Paper Code Dataset Benchmark

This work introduces RoboAfford, a novel large-scale dataset designed to enhance object and spatial affordance learning in robot manipulation. It encompasses three key dimensions: object affordance recognition, object affordance prediction, and spatial affordance localization.

Abstract

Robot manipulation is a fundamental capability of embodied intelligence, enabling effective robot interactions with the physical world. In robot manipulation tasks, predicting precise grasping positions and object placement is essential. Achieving this requires object recognition to localize target object, predicting object affordances for interaction and spatial affordances for optimal arrangement. While Vision-Language Models (VLMs) provide insights for high-level task planning and scene understanding, they often struggle to predict precise action positions, such as functional grasp points and spatial placements. This limitation stems from the lack of annotations for object and spatial affordance data in their training datasets.

To address this gap, we introduce RoboAfford, a novel large-scale dataset designed to enhance object and spatial affordance learning in robot manipulation. Our dataset comprises 819,987 images paired with 1.9 million question answering (QA) annotations, covering three critical tasks: object affordance recognition to identify objects based on attributes and spatial relationships, object affordance prediction to pinpoint functional grasping parts, and spatial affordance localization to identify free space for placement. Complementing this dataset, we propose RoboAfford-Eval, a comprehensive benchmark for assessing affordance-aware prediction in real-world scenarios, featuring 338 meticulously annotated samples across the same three tasks. Extensive experimental results reveal the deficiencies of existing VLMs in affordance learning, while fine-tuning on the RoboAfford dataset significantly enhances their affordance prediction in robot manipulation, validating the dataset's effectiveness. The dataset, benchmark and evaluation code will be made publicly available to facilitate future research.

Dataset Construction

Pipeline for constructing the RoboAfford dataset. We first discard the image with densely repeated objects, and then generate question answering pairs using human designed template or GPT-4o.

Comparison of Existing Affordance Datasets

Obj-Aff: Object Affordance. Spa-Aff: Spatial Affordance.

Methodology

Our RoboAfford-Qwen Framework. We fine-tune the model on the RoboAfford dataset to enhance object and spatial affordance capabilities. For robotic manipulation, we use depth information to transform 2D points representing objects and spatial affordances into 3D coordinates, which are then converted to end-effector positions for robotic manipulation.

Experimental Results

Comparison results of various VLMs on RoboAfford-Eval benchmark.

Real-world Experiments

Qualitative results of RoboAfford-Qwen, where cyan points indicate the object and spatial affordances.

Robot Manipulation

Results of deploying RoboAfford-Qwen to downstream robotic manipulation tasks.

Citation

RoboAfford++

@article{hao2025roboafford++,
  title={RoboAfford++: A Generative AI-Enhanced Dataset for Multimodal Affordance Learning in Robotic Manipulation and Navigation},
  author={Hao, Xiaoshuai and Tang, Yingbo and Zhang, Lingfeng and Ma, Yanbiao and Diao, Yunfeng and Jia, Ziyu and Ding, Wenbo and Ye, Hangjun and Chen, Long},
  journal={arXiv preprint arXiv:2511.12436},
  year={2025}
}

RoboAfford

@inproceedings{tang2025roboafford,
  title={RoboAfford: A Dataset and Benchmark for Enhancing Object and Spatial Affordance Learning in Robot Manipulation},
  author={Tang, Yingbo and Zhang, Lingfeng and Zhang, Shuyi and Zhao, Yinuo and Hao, Xiaoshuai},
  booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
  pages={12706--12713},
  year={2025}
}

License

The datasets and benchmarks are under the Creative Commons Attribution 4.0 International License.