Reinforcement Learning and Asynchronous Actor-Critic Agent (A3C) Algorithm, Explained

Original Source Here
While supervised and unsupervised machine learning is a much more widespread practice among enterprises today, reinforcement learning (RL), as a goal-oriented ML technique, finds its application in mundane real-world activities. Gameplay, robotics, dialogue systems, autonomous vehicles, personalization, industrial automation, predictive maintenance, and medicine are among RL’s target areas. In this blog post, we provide a concrete explanation of RL, its applications, and Asynchronous Actor-Critic Agent (A3C), one of the state-of-the-art algorithms developed by Google’s DeepMind.
Key Terms and Concepts
Reinforcement learning refers to the type of machine learning technique enabling an agent to learn to interact with an environment (area outside the agent’s borders) by trial and error using reward (feedback from its actions and experiences).
The agent is a learning controller taking actions in the environment and receives feedback in the form of reward.
The environment, space where the agent gets everything needed from a given state. The environment can be static or dynamic, and its changes can be stochastic and deterministic correspondingly. It is usually formulated as Markov decision process (MDP), a mathematical framework for decision-making development.
The agent seeks ways to maximize the reward via interacting with the environment instead of analyzing the data provided.
However, real-world situations often do not convey information to commit a decision (some context is left behind the currently observed scene). Hence, the Partially Observable Markov Decision Processes (POMDPs) framework comes on the scene. In POMDP the agent needs to take into an account probability distribution over states. In cases where it’s impossible to know that distribution, RL researchers use a sequence of multiple observations and actions to represent a current state (i.e., a stack of image frames from a game) to better understand a situation. It makes it possible to use RL methods as if we are dealing with MDP.
The reward is a scalar value that agents receive from the environment, and it depends on the environment’s current state (St), the action the agent has performed grounding on the current state (At), and the following state of the environment (St+1):
Policy (π) stands for an agent’s strategy of behavior at a given time. It is a mapping from the state to the actions to be taken to reach the next state. Speaking formally, it is a probability distribution over actions in a given state, meaning the likelihood of every action in a particular state.
In short, policy holds an answer to the “How to act?” question for an agent.
State-value function and action-value functionare the ways to assess the policy, as RL aims to learn the best policy.
The value function V holds an answer to the question “How good the current state is?”, namely an expected return starting from the state (S) and following policy (π).
Sebastian Dittert defines the action-value of a state as “the expected return if the agent chooses action A according to a policy π.”
Correspondingly, it is the answer to “How good current action is?”
Thus, the goal of an agent is to find the policy (π) maximizing the expected return (E[R]). Through the multiple iterations, the agent’s strategy becomes more successful.
One of the most crucial trade-offs for RL is balancing between exploration and exploitation. In short, exploration in RL aims at collecting experience from new, previously unseen regions. It potentially holds cons like a risk, nothing new to learn, and no guarantee to get any useful further information.
On the contrary, exploitation updates model parameters according to gathered experience. In its turn, it does not provide any new data and could not be efficient in case of scarce rewards. An ideal approach is making an agent explore the environment until being able to commit an optimal decision.
AI/ML
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot
via WordPress https://ramseyelbasheer.io/2021/03/25/reinforcement-learning-and-asynchronous-actor-critic-agent-a3c-algorithm-explained/