Our Approach



Based on our observations, we propose the taxonomy for components, which are required to describe the physical motions to fulfil an affordance. An affordance can have one or multiple of these components, which may be fulfilled at the same time or sequentially.

While not all classes of components could be included in the experiments, we evaluated the taxonomy by referencing a list of over hundred affordances and testing whether we were able to describe the affordances based on this taxonomy. We did not find any exceptions to the taxonomy, therefore, we consider it to be sufficiently complete for future use, though further verification is recommended.

Due to the limitless amount of affordances, it would be unfeasible to define and train every affordance individually. Instead, we tried to keep the reward functions as general as possible, only using the successful end-state of the affordance (i.e. the pot is covered or the card is inside the box) to simulate a generic approach to affordance learning.

There were a few more affordances taken into consideration for the experiments, however, most were dropped due to limited resources and time limitations on implementation. These affordances include: Stirring Objects in a Pot, Cutting a String, Cutting a Fruit and Wrapping a Chain around a Horizontal Pole. The chain affordance proved to be quite troublesome, in particular, as the handling of the chain by the physics engine turned out to be rather unreliable in both MuJoCo and PyBullet. While the chain by itself could be handled somewhat sufficiently, as soon as the agent gained control over any part of the chain, the behaviour became unpredictable and the joint connection seemed to be ineffective.

When comparing the two physics engines, MuJoCo had a lot of tunnelling between objects (i.e. the toast would simply drop through the pan). Using internal primitives instead of external objects solved these issues; although, in many cases, it came at the cost of precision. Furthermore, even small changes to mass seemed to have an unnaturally strong effect on the interaction between objects. While tunnelling could also be seen in PyBullet, it was only encountered at rather large simulation step sizes, which is expected.

All-in-all, we found that PyBullet was a lot more stable and predictable, as well as more accessible thanks to more extensive documentation. Unfortunately, this stability came at the cost of speed, as the training took about 2-3 times as long when using PyBullet instead of MuJoCo. This resulted in PyBullet not making it into the final training of the agents, due to time limitations. Thus, all the results have been achieved under the use of MuJoCo.

More than thirty 3D models are designed to perform different affordances in Virtual Reality as well as Motion Tracking. The objects were modelled after the measurements of their real world counterparts. They are created by the open source software, Blender.

The virtual reality setup is rather simple as it involves mainly the recreation of the motion tracking scene in the Unreal Engine. We experimented with a VR headset, the HTC Vive Pro Eye with its native controllers. To ensure that the environments of virtual reality scenes and OptiTrack scenes properly relate to each other, the scene's dimensions were measured beforehand and applied to the virtual scene. One major issue with the involved objects is that most of them were non-convex, as this property is quite common with household objects.

Unfortunately, the PhysX Engine can, just like MuJoCo and PyBullet, only handle convex objects. Since a pure convex hull would not be accurate enough for many of our objects, convex decomposition was necessary. At first, we tried the V-HACD algorithm which generated too many colliders, resulting in bad performance. As a solution, we had to decompose the objects manually in blender, which resulted in more uniform colliders, improving not only the performance, but also the collision detection in the process.

Finally, it had to be ensured that the starting position of the involved objects were consistent with the OptiTrack setup, which was achieved by marking key positions in the OptiTrack setup with tape---so that they can easily be recreated. Only basic interaction for the used virtual reality headset and controllers need to be implemented and accurately calibrated, and as long as these steps are followed, both environments are comparative.

OptiTrack Motive is a motion capture software developed by OptiTrack, which is used for capturing and analysing the movement of objects or people in 3D space. Motive uses advanced algorithms and image processing techniques to track and record the movement of reflective markers placed on the subject or object being tracked. When combined with Unreal Engine, it offers a solution to create virtual environments that can simulate real-life scenarios with high precision.

For the experiments, we used OptiTrack Motive 2.0.1, which was operated using a combination of cameras including 13 Primex 13 and 4 Primex 41. The experiments were conducted on a computer system comprising an Intel i7-4790 processor with 4 cores clocked at 3.6 GHz, 32 GB of 3600 MHz DDR4 RAM, and a Nvidia Titan V 12 GB graphics card. The system was integrated with Unreal Engine 4.27.

The initial step in utilising the OptiTrack system involves the precise setup of the tracking area by positioning the cameras in appropriate locations and calibrating them to ensure accurate tracking. This calibration process is critical to achieving reliable and reproducible tracking results.

The affordances are selected based on properties, which we estimated may pose challenges on the reinforcement learning, the physics engine, or the execution in virtual environments. Some of the selected affordances have properties that are challenging for reinforcement learning and are only tested in that context not in the virtual unreal environment.
Here is the list of affordances we used: