Unsupervised Skill Discovery in Deep Reinforcement Learning

Google AI: DADS – Unsupervised Reinforcement Learning for Skill Discovery

AI residents A. Sharma et al. from Google Research and Google Brain published an interesting approach addressing the complicated issue of specifying a well-designed task-specific reward function in an unsupervised manner in order to remedy the problems of manually labeling „goal“ states or introducing porbably costly instrumentations, e.g. in form of sensors.

Usually, an agent in a supervised reinforcement learning environment uses an extrensic reward function specifically designed to address a specific problem area. In contrast, in unsupervised reinforcement learning, an agent utilizes an intrinsic reward function, e.g. procedures mimicing behaviours like curiosity, ‚to generate its own trainings signals for acquiring a probably broad set of task-agnostic behaviors‘ – [Sharma et al. 2020].

Essentially, this would allow for neglegting the effort of desining an extrinsic, task-specific reward function while being generalizable for other tasks as well. Though it can be considered difficult to learn agent-environment interactions without a guiding reward signal, solving this specific problem in an unsupervised fashion could prove extremely rewarding for many domains outside of classic anthropomatics.

Sharma et al.’s work includes two current research papers, Dynamics-Aware Unsupervised Discovery of Skills, and the more recent Emergent Real-World Robotic Skills via Unsupervised Off-Policy Reinforcement Learning, from Google Research and Google Brain.

‚The behavior on the left is random and unpredictable, while the behavior on the right demonstrates systematic motion with predictable changes in the environment. Our goal is to learn potentially useful behaviors such as those on the right, without engineered reward functions.‘ – ai.googleblog.com

In their foundational work Dynamics-Aware Unsupervised Discovery of Skills they introduce „predictability“ as optimization objective for discovering new skills, essentially allowing to create a dynamics model of the environment which in turn enables the use of planning algorithms. They prove the practicability of the approach in a simulated robotics setup. Sharma et al. improve the sample efficiency in their follow-up work and prove the practicability of DADS in a real-world scenario using an off-policy variant for training D’Kitty from ROBEL

DADS is based on the design of an intrinsic reward function which encourages a curiosity like behavior for discovering both a „predictable“ and „diverse“ skill set. As there is not any reward given by the environment, optimizing skills with respect to diversity allows the agent to discover many potentially useful behaviors.

They utilize a second neural network (skill-dynamics network) to actually predict if a skill is associated with a predictable change in the environment. Better prediction performance of the skill_dynamics network is associated with the preditability of environmental state changes and therefor with the predictability of the skill.

Schematic of the DADS model – ai.googleblog.com

Essentially, Sharma et al.’s approach could lead to interesting possibilities in the lesse complex areas of online retail, lead management and customer experience if applied accordingly to specific intrinsic problem domains.

If you are interested in the resources and their current research regarding unsupervised reinforcement learning, have a look at the following resources.

Google AI Blog Post

Dynamics-Aware Unsupervised Discovery of Skills

Emergent Real-World Robotic Skills via Unsupervised Off-Policy Reinforcement Learning

Github Repo: DADS

Kind regards,

Henrik Hain

This article has been first published on:

Gefällt Ihnen der Artikel?

Share on linkedin
Share on Linkdin
Share on xing
Share on XING
Share on twitter
Share on Twitter
Share on facebook
Share on Facebook

    Ihre Daten werden gemäß unserer Datenschutzerklärung erhoben und verarbeitet.
    Künstliche Intelligenz Partials
    Data Analytics

    Time Series Data Clustering Distance Measures

    As ubiquitous as time series are, it is often of interest to identify clusters of similar time series in order to gain better insight into the structure of the available data. However, unsupervised learning from time series data has its own stumbling blocks. For this reason, the following article presents some helpful time series specific distance metrics and basic procedures to work successfully with time series data.

    Weiterlesen »
    Künstliche Intelligenz Business Prozess
    Data Analytics

    Personalize Learning to Rank Results through Reinforcement Learning

    Learning to optimally rank and personalize search results is a difficult and important topic in scientific information retrieval as well as in online retail business, where we typically want to bias customer query results with respect to specific preferences for the purpose of increasing revenue. Reinforcement learning, as a generic-flexible learning model, is able to bias, e.g. personalize, learning-to-rank results at scale, so that externally specified goals, e.g. an increase in sales and probably revenue, can be achieved. This article introduces the topics learning-to-rank and reinforcement learning in a problem-specific way and is accompanied by the example project ‚cli-ranker‘, a command line tool utilizing reinforcement learning principles for learning user information retrieval preferences regarding text document ranking.

    Weiterlesen »