Policy Gradient from Scratch

Before going into the policy gradient we need to have some understanding/review of the following term is the discounted future return from time t or , is the policy, note that it is a distribution we will discuss that if the policy is deterministic later (hopefully). we also have a state transition probability , a state value function and a action value function .

In reinforcement learning, the objective is defined as

Then we can expand it to the following integration over all possible trajectory

where

Above we see that the expected return of a trajectory is the integration over the product of initial state probability, the policy output distribution and the state transition probability and the collected return from all possible trajectory.

The gradient with respect to the policy parameter

Use the log gradient trick

Expand we have

Taking gradient w.r.t. gives

since and do not depend on .

Then we could write

Note that contains all time step’s future return, however for Expectation at time step , by Markov property, rewards before time do not depend on action , so we can replace with , and write that

Which is the standard REINFORCE formulation , by law of total expectation

and that is a nosier realization of if Q is exact (conditional expectation always reduce variance ) , with law of total expectation, and conditioning on the state and action, we get the following

which is policy gradient, below is a detailed derivation, we look at a specific time-step

Note that is a Monte-Carlo sample of the trajectory and is ill-posed with high variance, now our mission is to find ways to reduce the variance. first, we introduce a baseline that is only state dependent, now we proceed to prove that such baseline does not create bias on the policy gradient.

Then

Note that here we use the log trick again

The equation above become

So as long as we have a non action dependent baseline, we can arrive have an unbiased estimation of policy gradient.