Policy Gradient from Scratch
Before going into the policy gradient we need to have some understanding/review of the following term is the discounted future return from time t or , is the policy, note that it is a distribution we will discuss that if the policy is deterministic later (hopefully). we also have a state transition probability , a state value function and a action value function .
In reinforcement learning, the objective is defined as
Then we can expand it to the following integration over all possible trajectory
where
Above we see that the expected return of a trajectory is the integration over the product of initial state probability, the policy output distribution and the state transition probability and the collected return from all possible trajectory.
The gradient with respect to the policy parameter
Use the log gradient trick
Expand we have
Taking gradient w.r.t. gives
since and do not depend on .
Then we could write
Note that contains all time step’s future return, however for Expectation at time step , by Markov property, rewards before time do not depend on action , so we can replace with , and write that
Which is the standard REINFORCE formulation , by law of total expectation
and that is a nosier realization of if Q is exact (conditional expectation always reduce variance ⇐) , with law of total expectation, and conditioning on the state and action, we get the following
which is policy gradient, below is a detailed derivation, we look at a specific time-step
Note that is a Monte-Carlo sample of the trajectory and is ill-posed with high variance, now our mission is to find ways to reduce the variance. first, we introduce a baseline that is only state dependent, now we proceed to prove that such baseline does not create bias on the policy gradient.
Then
Note that here we use the log trick again
The equation above become
So as long as we have a non action dependent baseline, we can arrive have an unbiased estimation of policy gradient.