Building Reinforcement Learning Agents that Learn to Collaborate and Compete at the Same Time
Original Source Here
Building Reinforcement Learning Agents that Learn to Collaborate and Compete at the Same Time
OpenAI has been experimenting with techniques that solve one of the major challenges in reinforcement learning applications.
I recently started an AI-focused educational newsletter, that already has over 70,000 subscribers. TheSequence is a no-BS (meaning no hype, no news etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:
Coopetition is a neologism typically used to describe a balanced relationship between cooperation and competition. Coopetition is one of the hallmarks of evolution and one of the best-established dynamics in social environments as humans come together to achieve a specific goal while remaining competitive towards other objectives. Multi-Agent reinforcement learning(MARL) is the discipline of the deep learning space that resembles our social environments as agents need to interact to accomplish a specific task. Learning to collaborate and compete seems like a clear step in the evolution of MARL. However, most MARL methods focus on training agents in isolation which constraints the emergence of collaborative behaviors. Recently, a team of researchers from OpenAI published a research paper that proposes a MARL algorithm to allows agents to learn to collaborate and compete with each other in a group environment.
The Coopetition Challenge
MARL environments poses significant challenges for the creation of coopetitive policies between agents. To begin with, multi-agent environments rarely have a stable Nash equilibrium which causes agents to have to constantly adapt their policies. As a result, there is an intrinsic pressure for agents to always get smarter and not necessarily collaborative. Not surprisingly, we see MARL models that focus on fomenting competition or collaboration but rarely both.
The simplest approach to learning in multi-agent settings is to use independently learning agents. This is the approach followed by popular reinforcement learning algorithms such as Q-Learning or policy gradients, but they have shown to be poorly suited for multi-agent environments. The challenge with traditional reinforcement learning methods in multi-agent scenarios has to do with the centralized approach to training and policy evaluation. In multi-agent environments, each agent’s policy is changing as training progresses causing the environment becomes non-stationary from the perspective of any individual agent in a way that is not explainable by changes in the agent’s own policy. This presents learning stability challenges and prevents the straightforward use of past experience replay, which is crucial for stabilizing deep Q-learning. Policy gradient methods, on the other hand, usually exhibit very high variance when coordination of multiple agents is required.
Multi-Agent Actor Critic
To overcome some of the challenges of traditional reinforcement learning techniques, OpenAI introduced a method that combines centralized training with decentralized execution, allowing the policies to use extra information to ease training. Called MADDPG (as it extends the principles of another reinforcement learning algorithm called DDPG to multi-agent settings), the OpenAI algorithm allow agents to learn from their own actions as well as the actions of other agents in the environment.
In the MADDPG model, each agent is treated as an “actor” which gets advice from a “critic” that helps the actor decide what actions to reinforce during training. To goal of the critic is to try to predict the value(i.e. the reward we expect to get in the future) of an action in a particular state, which is used by the agent — the actor — to update its policy. By using predictions of future rewards, MADDPG injects some stability over time compared to traditional reinforcement learning methods as the actual rewards can vary considerably in multi-agent environments. To make it feasible to train multiple agents that can act in a globally-coordinated way, MADDPG allow critics to access the observations and actions of all the agents. The following diagram illustrates the basic constructs of the MADDPG model.
The key contribution of the MADDPG method us that agents don’t need to access the central critic at test time and instead they act based on their observations in combination with their predictions of other agents behaviors’. Since a centralized critic is learned independently for each agent, this approach can also be used to model arbitrary reward structures between agents, including adversarial cases where the rewards are opposing.
MADDPG in Action
To see the value of MADDPG, lets take a simple game in which some agents(red dots) try to chase other agents(green dots) before they get to the water(blue dots). Using MADDPG the red agents learn to team up with one another to chase a single green agent, gaining higher reward. The green agents, meanwhile, learned to split up, and while one is being chased the other tries to approach the water.
The OpenAI team tested MADDPG across a series of experiments that evaluated both the cooperative and competitive behavior of agents.
a) Cooperative Communication: This task consists of two cooperative agents, a speaker and a listener, who are placed in an environment with three landmarks of differing colors. At each episode, the listener must navigate to a landmark of a particular color, and obtains reward based on its distance to the correct landmark.
b) Predator-Prey: In this variant of the classic predator-prey game, N slower cooperating agents must chase the faster adversary around a randomly generated environment with L large landmarks impeding the way.
c) Cooperative Navigation: In this environment, agents must cooperate through physical actions to reach a set of L landmarks. Agents observe the relative positions of other agents and landmarks, and are collectively rewarded based on the proximity of any agent to each landmark.
d) Physical Deception: Here, N agents cooperate to reach a single target landmark from a total of N landmarks. They are rewarded based on the minimum distance of any agent to the target (so only one agent needs to reach the target landmark).
In all scenarios, MADDPG outperformed traditional reinforcement learning method as shown in the following chart.
Recently, we are seeing coopetition become a more important component of MARL scenarios. The results achieved by OpenAI and DeepMind in multi-player games such as Dota2 or Quake III respectively are clear examples that coopetition is a very achievable goal in MARL environments. Techniques such as MADDPG can help to streamline the adoption of coopetitive multi-agent techniques. The OpenAI team open sourced an initial version of MADDPG on GitHub.
AI/ML
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot
via WordPress https://ramseyelbasheer.io/2021/03/26/building-reinforcement-learning-agents-that-learn-to-collaborate-and-compete-at-the-same-time/