Inverse Reinforcement Learning and Affordances
Reinforcement learning requires defining a reward function to learn an appropriate policy. However, this can be challenging, depending on the complexity of the task. This master's thesis investigates the effectiveness of imitation learning algorithms in mimicking behavior given demonstrations from an expert within a virtual environment. It also incorporates techniques such as task rewards and hyperparameter optimization to help overcome limitations familiar with these approaches. The final policies could not execute the task reliably. However, hyperparameter optimization showed promising results.
Description
The field of imitation learning algorithms comprises different categories. Regardless, only inverse reinforcement learning (IRL) and adversarial imitation learning (AIL) are important for this thesis. IRL tries to infer a reward function based on expert demonstrations, but most of the algorithms in this field operate in discrete space, which is not feasible for affordance tasks. More important is AIL, which uses a similar two-player structure as IRL but realizes this through generative adversarial networks. The algorithms from this category are in continuous space and available in different packages, such as "imitation" from the Center for Human-Compatible AI. Nevertheless, the standard approaches can introduce problems such as generalization for multi-stage tasks. To counter that problem, task reward, a sparse reward function that highlights stages, is introduced. Also, the wrong selection of hyperparameters can lead to the complete failure of an algorithm or, at least, to performance loss. This problem introduces a technique known as hyperparameter optimization, realized in this case through Optuna, to find the best set of hyperparameters automatically.
The final implementation utilizes Unreal Engine as a training environment for the learning algorithms and as a VR environment for trajectory recording. In addition to Unreal Engine, the plugin USemLog allows the implementation to track objects and thus easily create trajectories from this information. The implementation supports continuous and discrete state and action space and allows multiple agents to run in parallel. However, most algorithms are written in Python, which Unreal Engine does not natively support during runtime, which means the learning instance has to run in a separate instance. Both Unreal Engine and this instance will then communicate through TCP sockets. The learning implementation utilizes AIRL and GAIL, two well-known AIL algorithms from the imitation package. Their main difference is that AIRL defines the discriminator network as explicitly learning the reward function, while GAIL uses some classifier. In addition, the implementation also includes packages such as Gymnasium, Sacred, and Omniboard to manage and view the algorithms' results.
Training took place in three environments: Covering, Insert, and Stacking. Each environment consists of an agent, represented as a hand, that can move freely within the bounds of the state space and grab objects.
Results
An expert recorded a training set comprising 100 trajectories for each environment as an input for the imitation learning algorithms and a testing set with 12 trajectories for each environment. Two different systems served as a ground for training. One optimized the hyperparameters using 60 runs and 1,300,000 samples, resulting in 5-6 training days for each environment and learning algorithm combination. The best results were then run against manually tuned hyperparameters. In parallel, the second system ran for the final training with 5,000,000 samples, which meant 6-8 hours for one run, depending on the specific hyperparameters selected.
The plots below show the changes in the mean return of a policy over time compared to an expert (red dotted line) using the testing set. Each plot defines the used hyperparameter set (automatically or manually), the environment, and the learning algorithm. Furthermore, one plot combines multiple runs with different influences from the task reward.
A broad search with many different configurations showed a tendency toward the right direction for Covering and Insert, even though task rewards seemed to have little impact. GAIL, in particular, proved to be more robust against incorrect selection of hyperparameters by at least partially solving the task in the given time. At the same time, AIRL had problems learning to grab an object. However, even at the end of the training, there are often significant changes in the mean returns, suggesting that longer training time is needed. More problematic was Stacking, where the agent seemed not to overcome some basic movement. Here, even GAIL seems more sensitive to the choice of hyperparameters when comparing automatically and manually tuned hyperparameters, resulting in chaotic movement for manually tuned hyperparameters. AIRL suffered the same problem.
Files
Full version of the master's thesis (English only)
The video below compares expert demonstrations and promising trajectories from the best policies. Keep in mind that the video only shows good-performing trajectories; the policies in their current state are not reliable in reproducing this behavior all the time.
License
This original work is copyright by University of Bremen.
Any software of this work is covered by the European Union Public Licence v1.2.
To view a copy of this license, visit
eur-lex.europa.eu.
The Thesis provided above (as PDF file) is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International.
Any other assets (3D models, movies, documents, etc.) are covered by the
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
To view a copy of this license, visit
creativecommons.org.
If you use any of the assets or software to produce a publication,
then you must give credit and put a reference in your publication.
If you would like to use our software in proprietary software,
you can obtain an exception from the above license (aka. dual licensing).
Please contact zach at cs.uni-bremen dot de.