Reinforcement Learning Explained Visually (Part 6): Policy Gradients, step-by-step
Original Source Here
Now let’s zoom in on this first phase.
The Policy Network acts as the agent and interacts with the environment for an episode. From the current state, the Policy Network predicts a probability distribution of actions. It then samples an action from this distribution, executes it with the environment, and gets back a reward and the next state. It saves each observation as a sample of training data.
The Policy Network remains fixed and does not improve during this phase.
Now we’ll zoom in on the next phase of the flow.
At the end of an episode, the training data is input to the Policy Network. It takes the current state and action from the data sample and predicts the probability of that action.
The Discounted Return is calculated by taking the reward from the data sample.
The discounted return and action probability are used to compute the Loss to train the Policy Network. The Policy Network learns to predict better actions until it finds the Optimal Policy.
Policy-based solutions have some advantages over a Value-based solution
A Value-based solution learns a Value function that outputs the Q-value of all actions that can be taken from a state. They would struggle with problems where the number of actions is very large. For instance, you could have continuous actions (eg. a robot can rotate by any angle including fractional angles from 0 degrees to 360 degrees, raise its arm or turn its head by any angle, move at a speed which could be any real number between some range, and so on) which result in an infinite number of actions.
On the other hand, a Policy-based solution like Policy Gradient learns a policy function that outputs a probability distribution of actions, and therefore, they can handle problems with a large number of actions.
It also means that Policy Gradient can learn deterministic policies as well as stochastic policies, where there isn’t always a single best action. Since the agent picks an action by sampling from the probability distribution it can vary its actions each time.
On the other hand, with value-based solutions, the Optimal Policy is derived from the Optimal Q-value by picking the action with the highest Q-value. Thus they can learn only deterministic policies.
Finally, since policy-based methods find the policy directly, they are usually more efficient than value-based methods, in terms of training time.
Policy Gradient ensures adequate exploration
In contrast to value-based solutions which use an implicit ε-greedy policy, the Policy Gradient learns its policy as it goes.
At each step, it picks an action by sampling the predicted probability distribution. As a result it ends up taking a variety of different actions.
When training starts out, the probability distribution of actions that it outputs is fairly uniform since it doesn’t know which actions are preferred. This encourages exploration as it means that all actions are equally likely. As training progresses, the probability distribution converges on the best actions, thus maximizing exploitation. If there is only one best action for some state, then that probability distribution is a degenerate distribution resulting in a deterministic policy.
Policy Gradient Loss function
With Q-Learning and DQN, we had a target Q-value with which to compare our predicted Q-value. So it was fairly easy to define a Loss function for it. However, with Policy Gradient, there is no such target feature with which to compare its predicted action probability distribution.
All we have available is whether a selected action led to positive or negative rewards. How do we use this to define a Loss function in this case?
Let’s think about the objective that we want the Loss function to achieve. If an action results in a positive reward, we want that action to become more likely (ie. increase its probability) and if it results in a negative reward, we want that action to become less likely (ie. decrease its probability).
However, for a positive action, we don’t want to increase its probability to 1, which means that it becomes the only action to be taken, as the probability of all other actions would be reduced to 0. Similarly, for a negative action, we don’t want to reduce its probability to 0, which means that that action will never get taken. We want to continue exploring and trying different actions multiple times.
Instead, for a positive action, we want the probability to increase slightly while reducing the probability of all other actions slightly (since they all must sum to 1). This makes that action slightly more likely.
In other words, if we took action a1, and P(a) is the probability of taking action ‘a’, we want to maximize P(a1). Since Loss functions work by minimizing rather than maximizing, we can equivalently use 1 — P(a1) as our Loss. It approaches 0 as P(a1) approaches 1, so it encourages the network to modify its weights to adjust the probability distribution by increasing P(a1).
Similarly for a negative action.
Probability is represented as a floating point number and when those numbers are very small (near 0) or very large (near 1), these computations can cause numerical overflow problems.
To avoid this, we usually use -log (P(a1)) which behaves equivalently to 1 — P(a1). It also approaches 0 as P(a1) approaches 1, but ‘log’ does not suffer from overflow issues because those values have the advantage of having a broader range from (–∞,0) instead of (0, 1).
Finally, we do not want to assign equal weightage to every action. Those actions which resulted in more reward during the episode should get higher weightage. Therefore we use the Discounted Return of each action to determine its contribution. So the Loss function is -log (P(a)) * Rt, where Rt is the Discounted Return for action ‘a’. That way the probability of each action taken is updated in proportion to the Discounted Return resulting from that action.
Detailed Operation
Now let’s look at the detailed operation of the REINFORCE algorithm for Policy Gradients. From some initial starting state, the Policy Network takes the current state as input and outputs a probability distribution of all actions.
We then pick an action by sampling from that distribution.
The action is fed to the environment which generates the Next state and reward. The Next state is fed back into the Policy Network.
Each sample of observed experience is stored as training data. Note that the Policy Network weights remain fixed during this phase.
We continue doing this till the end of the episode. After that, we start the phase to train the Policy Network using the training data as input. The training batch holds an episode of samples.
Each state that was visited in the episode is input to the Policy Network.
The Policy Network predicts the probability distribution of all actions from each of those states.
AI/ML
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot
via WordPress https://ramseyelbasheer.wordpress.com/2021/01/09/reinforcement-learning-explained-visually-part-6-policy-gradients-step-by-step/