PILLAR-Robots Visual perception for enabling purpose grounding

In the PILLAR-Robots project, the ARC team is focused on developing advanced multimodal perception modules to equip robots with essential tools for interpreting their environments and interacting with purpose. The perception modules are crucial for unlocking the full potential of open-ended learning, enabling robots to ground their actions in specific, purposeful tasks.

The backbone we are building includes a comprehensive suite of modules that serve two primary functions. First, they provide robots with essential perception tools for primitive understanding of their surrounding scenes (e.g., object detection, scene segmentation) and for extracting information from human interactions (e.g., action recognition, speech recognition, gesture recognition). Secondly, these perception tools assist robots in grounding their tasks within the environment through integration with other elements in the PILLAR project.

The modules fall into two main categories: Scene Understanding and Human Understanding. Scene Understanding enables robots to recognize objects, segment scenes into components, understand spatial arrangements, and capture object relationships—critical for grounding purpose within a visual context. For example, to interpret an image of a book on a shelf, the robot must be able to detect both the book and the shelf and understand their spatial relationship. Human Understanding, on the other hand, equips robots with capabilities to decode human actions and commands. Perception modules here, such as speech recognition, gesture recognition, and affect recognition, will allow robots to ground purpose through human interaction, whether through visual examples (like a person pointing to an apple) or natural language commands (i.e., receiving verbal instructions).

Building upon these modules, the PILLAR project will enable robots to achieve purpose-driven grounding through integration with other elements. For example, visual examples can guide a robot to understand that placing a book on a shelf fulfills a particular purpose. Additionally, spoken or text commands can direct the robot’s actions, requiring it to integrate multiple perceptual cues for interpreting and executing tasks based on human input.

Each perception module is designed with real-world applications in mind, particularly for agri-food, edutainment, and industrial workshops. For the agri-food domain, our modules are tailored to recognize specific fruits and vegetables. In edutainment, modules focused on human understanding are prioritized, allowing robots to interpret affect, engagement, and feedback from the audience. In industrial settings, modules like 6DoF pose estimation are integrated to facilitate tasks such as object manipulation and workspace navigation.

Overall, the perception modules built will be the backbone that will enable robots to operate effectively in diverse environments and continuously learn and adapt from their experiences in a purposeful manner. As we progress, these modules are being validated through simulated and real-world scenarios, ensuring that the developed perception systems are robust, adaptable, and effective across various domains

Leave a Reply

Your email address will not be published. Required fields are marked *