Influence of Different Hand Models on Rendered Training Data for Application in Pose Estimation of Objects

This bachelor's thesis explores the influence of different hand models for synthetic training data in 6D pose estimation for objects in hand-object interactions. To generate comparable image datasets a pipeline for generating hand-object interaction in diverse environments was developed.

Description

To compare only the difference incurred by the hand representations a new set of datasets needed to be created. This is also due to the lack of such datasets in the literature. Simulating realistic hands is no trivial task, given the complexity of human hands and the difficulty of simulating organic matter. The MANO and NIMBLE hand models provide joint angle parameters to generate realistic hand meshes in arbitrary grasping poses. For this thesis an application was developed that helps generating adequate grasping poses for a set of objects. The hand-object interactions generated this way are then fed into a parameterized render pipeline to generate datasets of various grasped objects in diversly textured and lighted environments.

Three core datasets were generated using this method.

Using the MANO Model but with an added forearm.
Using the MANO Model with roughly 800 vertices, monochrome.
Using the NIMBLE Model with roughly 5000 vertices and a generated texture.

These datasets were used to train multiple instances of a ZebraPose network and then evaluated on the DexYCB dataset, which consists of real annotated images of grasped objects.

Results

The networks trained on the new datasets yielded proper pose estimations on the test dataset, as shown in the above collection of images.

To quantify the correctness of estimated poses, the results are typically evaluated using the (a)symmetric ADD(-S) metrics. It describes the average distance of corresponding object vertices between the predicted pose and the ground truth. An average distance under 10% of the object size is considered a correct pose. The Area Under the Curve (AUC) describes the behavior of the metric over different thresholds.

The following tables additionally contain the results of a network trained on ungrabbed objects (O).
Firstly, the metrics show no significant difference between the MANO and the NIMBLE Datasets. In contrast, training on the dataset with an artificial forearm seemed to clearly benefit symmetric pose accuracy.

The AUC also shows the performance advantage by including part of the forearm during training. It also reveals a slight advantage of using the NIMBLE dataset over the MANO dataset around the one percent mark.

Additionally another network instance was trained on a same sized dataset consisting of 50/50 MANO grabbed and ungrabbed objects. Interestingly this combination lead to a very clear, object independent, improvement in results compared to using only the full MANO dataset.

Files

Full version of the bachelor's thesis (German only)

License

This original work is copyright by University of Bremen.
Any software of this work is covered by the European Union Public Licence v1.2. To view a copy of this license, visit eur-lex.europa.eu.
The Thesis provided above (as PDF file) is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International.
Any other assets (3D models, movies, documents, etc.) are covered by the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit creativecommons.org.
If you use any of the assets or software to produce a publication, then you must give credit and put a reference in your publication.
If you would like to use our software in proprietary software, you can obtain an exception from the above license (aka. dual licensing). Please contact zach at cs.uni-bremen dot de.