If we want a robot to truly learn on its own – what we call open-ended learning – what kind of basic “senses” or perception abilities does it actually need? And looking at the three main environments the PILLAR project focuses on (Agriculture, Edutainment, and Industrial), what specific systems has your team developed to give the robots these abilities?
To solve this, we essentially give the robot a dynamic “visual memory bank” that grows as it explores. First, to stop the robot from getting confused by its own anatomy, we actually pre-load this memory with images of the robot’s own arms and tag them as things to ignore. This simple step ensures the robot doesn’t waste time trying to “discover” its own body and focuses its attention entirely on the room.
Then, to keep it from getting overwhelmed by a sea of new objects, we don’t just let it rely on pure, random curiosity. We can gently steer the robot by giving it simple clues – like a picture of a specific tool or even a short text prompt. It nudges its attention toward things that are actually relevant to its task, drastically cutting down the time it spends interacting with useless clutter, highlights the areas of the scene most relevant to the concept. For a messy room, it lights up the scattered objects on the floor; for human feelings, it highlights faces and hands. This allows the robot to instinctively focus its attention on the right areas without us having to hard-code a rule for every single situation.
Agricultural environments present a unique challenge because “produce” is never uniform; every pear or bell pepper has a slightly different 3D shape and deformation. How does your work in category-level pose estimation allow a robot to successfully harvest these ‘non-rigid’ objects without needing a specific 3D CAD model for every individual fruit in the field?
You’ve hit on exactly why agricultural robotics is so difficult! Instead of trying to give the robot a perfect blueprint for every individual fruit – which is impossible – we teach it the general concept of the fruit. We give the robot a single, generic “template” for a category, like a basic 3D pear shape. Then, our system looks at a standard camera image and calculates how to dynamically warp, stretch, and bend that generic template to perfectly match the unique, real-world piece of fruit hanging on the branch.
The real breakthrough here is that we can estimate both the exact position (the pose) and the unique shape (the deformation) using just standard color cameras, rather than relying on expensive and finicky depth sensors that often struggle in dense foliage. To pull this off, we actually train the robot entirely in highly realistic, AI-enhanced simulations. By the time it is placed in a messy, unpredictable orchard, it already understands how a basic fruit can deform and can adapt its “template” to the real thing in milliseconds.
