Based on our observations, we propose the taxonomy for components, which are required to describe the
physical motions to fulfil an affordance. An affordance can have one or multiple of these
components, which may be fulfilled at the same time or sequentially.
While not all classes of components
could be included in the experiments, we evaluated the taxonomy by referencing a list of over hundred
affordances and testing whether we were able to describe the affordances based on this taxonomy. We
did not find any exceptions to the taxonomy, therefore, we consider it to be sufficiently complete for
future use, though further verification is recommended.
Due to the limitless amount of affordances, it would be unfeasible to define and train every affordance
individually.
Instead, we tried to keep the reward functions as general as possible, only using the successful
end-state of the affordance (i.e. the pot is covered or the card is inside the box) to simulate a
generic approach to affordance learning.
There were a few more affordances taken into consideration for the experiments, however, most were
dropped due to limited resources and time limitations on implementation.
These affordances include: Stirring Objects in a Pot, Cutting a String, Cutting a Fruit and Wrapping a
Chain around a Horizontal Pole.
The chain affordance proved to be quite troublesome, in particular, as the handling of the chain by the
physics engine turned out to be rather unreliable in both MuJoCo and PyBullet.
While the chain by itself could be handled somewhat sufficiently, as soon as the agent gained control
over any part of the chain, the behaviour became unpredictable and the joint connection seemed to be
ineffective.
When comparing the two physics engines, MuJoCo had a lot of tunnelling between objects (i.e. the toast
would simply drop through the pan).
Using internal primitives instead of external objects solved these issues; although, in many cases, it
came at the cost of precision.
Furthermore, even small changes to mass seemed to have an unnaturally strong effect on the interaction
between objects.
While tunnelling could also be seen in PyBullet, it was only encountered at rather large simulation step
sizes, which is expected.
All-in-all, we found that PyBullet was a lot more stable and predictable, as well as more accessible
thanks to more extensive documentation.
Unfortunately, this stability came at the cost of speed, as the training took about 2-3 times as long
when using PyBullet instead of MuJoCo.
This resulted in PyBullet not making it into the final training of the agents, due to time limitations.
Thus, all the results have been achieved under the use of MuJoCo.
More than thirty 3D models are designed to perform different affordances in Virtual Reality as well as
Motion Tracking. The objects were modelled after the measurements of their real world counterparts. They
are created by the open source software, Blender.
The virtual reality setup is rather simple as it involves mainly the recreation of the motion tracking
scene in the Unreal Engine. We experimented with a VR headset, the HTC Vive Pro Eye with its native
controllers.
To ensure that the environments of virtual reality scenes and OptiTrack scenes properly relate to each
other, the scene's dimensions were measured beforehand and applied to the virtual scene.
One major issue with the involved objects is that most of them were non-convex, as this property is
quite common with household objects.
Unfortunately, the PhysX Engine can, just like MuJoCo and PyBullet, only handle convex objects.
Since a pure convex hull would not be accurate enough for many of our objects, convex decomposition was
necessary.
At first, we tried the V-HACD algorithm which generated too many colliders, resulting in bad
performance.
As a solution, we had to decompose the objects manually in blender, which resulted in more uniform
colliders, improving not only the performance, but also the collision detection in the process.
Finally, it had to be ensured that the starting position of the involved objects were consistent with
the OptiTrack setup, which was achieved by marking key positions in the OptiTrack setup with tape---so
that they can easily be recreated.
Only basic interaction for the used virtual reality headset and controllers need to be implemented and
accurately calibrated, and as long as these steps are followed, both environments are comparative.
OptiTrack Motive is a motion capture software developed by OptiTrack, which is used for capturing and
analysing the movement of objects or people in 3D space.
Motive uses advanced algorithms and image processing techniques to track and record the movement of
reflective markers placed on the subject or object being tracked.
When combined with Unreal Engine, it offers a solution to create virtual environments that can simulate
real-life scenarios with high precision.
For the experiments, we used OptiTrack Motive 2.0.1, which was operated using a combination of cameras
including 13 Primex 13 and 4 Primex 41.
The experiments were conducted on a computer system comprising an Intel i7-4790 processor with 4 cores
clocked at 3.6 GHz, 32 GB of 3600 MHz DDR4 RAM, and a Nvidia Titan V 12 GB graphics card.
The system was integrated with Unreal Engine 4.27.
The initial step in utilising the OptiTrack system involves the precise setup of the tracking area by
positioning the cameras in appropriate locations and calibrating them to ensure accurate tracking.
This calibration process is critical to achieving reliable and reproducible tracking results.
The affordances are selected based on properties, which we estimated may pose challenges on the
reinforcement learning, the physics engine, or the execution in virtual environments.
Some of the selected affordances have properties that are challenging for reinforcement learning and are
only tested in that context not in the virtual unreal environment.
Here is the list of affordances we used:
- Covering a Pot
- Flipping Object in a Pan
- Pouring Liquids
- Insert card in mailbox
- Balancing objects on a tray
- Puzzle assembly