Brief description of the results achieved
In the PILLAR-Robots project, the ARC team is focused on developing advanced multimodal perception modules that address the needs of the three main use cases: agriculture, edutainment, and industrial workshops. Over the past months, we have built modules that enable robots to better perceive their surroundings and perform purposeful tasks within the considered scenarios.
In agriculture, post-harvesting tasks highlight the perception challenges robots face, as fruits and vegetables vary greatly in shape, size, and ripeness. To tackle this, we developed PLANTPose [1], a category-level 6D pose estimation framework that works directly on RGB images. Instead of relying on detailed 3D models for every individual fruit, PLANTPose adapts a single base model to many unseen instances by predicting both their pose and shape variations. To improve realism, we enrich synthetic training data with stable diffusion, creating textures that capture the different ripeness stages of fruits. Tested on bananas, a crop that naturally shows large variations in shape and color, PLANTPose achieved high accuracy and significantly outperformed state-of-the-art baselines.
While PLANTPose was designed with agricultural products in mind, the same approach can be applied to industrial settings, where tools and components often vary slightly in shape or size. In these cases, reliable pose estimation is essential for handling workshop tools, such as screwdrivers or pliers, with precision. Beyond this, the industrial scenario also requires robots to understand the workspace at a higher level. For this purpose, we developed image captioning and visual question answering, as well as scene graph extraction modules. These achieved very promising results, enabling robots to describe their surroundings, answer task-related questions, and map the relations between objects. For example, recognizing that a hammer is inside a toolbox or that two tools lie side by side on a bench. While these capabilities were tailored to workshop environments, they can also support agricultural tasks, such as describing the contents of a packaging station or mapping the arrangement of produce on a conveyor.
In contrast, the edutainment scenario places the focus on people rather than objects. Here, robots must perceive human activity in a classroom-like setting. We developed core perception modules, including human pose estimation, face detection, gaze direction and target estimation, and object detection. Together, these allow the robot to recognize whether a student is working at a laptop, interacting with the Robobo robot, or receiving help from a teacher. To make interaction more natural, we also introduced a hand gesture recognition module, enabling the robot to interpret non-verbal cues. These modules achieved encouraging results, supporting the development of intelligent tutoring robots that can adapt to classroom dynamics and provide assistance when needed.
Across these three use cases, the perception modules developed by the ARC team form a versatile and robust foundation for the PILLAR-Robots project. By enabling robots to understand both objects and human activities, we are getting closer to the project’s vision of purposeful, adaptive robots that can operate effectively across diverse domains.
[1] M. Glytsos, P. P. Filntisis, G. Retsinas and P. Maragos, “Category-Level 6D Object Pose Estimation in Agricultural Settings Using a Lattice-Deformation Framework and Diffusion-Augmented Synthetic Data”, Proc. of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025.
