Policy Gradient from Scratch

Before going into the policy gradient we need to have some understanding/review of the following term $G (t)$ is the discounted future return from time t or $G_{t} = r_{t} + γ r_{t + 1} + γ^{2} r_{t + 2} + \dots$ , $π_{θ} (a ∣ s)$ is the policy, note that it is a distribution we will discuss that if the policy is deterministic later (hopefully). we also have a state transition probability $P (s_{t + 1} ∣ s_{t}, a_{t})$ , a state value function $v (s_{t})$ and a action value function $Q (s_{t}, a_{t}) = E [G_{t} ∣ s_{t}, a_{t}]$ .

In reinforcement learning, the objective is defined as

J (θ) = E_{τ \sim π_{θ}} [G_{0} (τ)]

Then we can expand it to the following integration over all possible trajectory

E_{τ \sim π_{θ}} [G_{0} (τ)] = \int p_{θ} (τ) G_{0} (τ) d τ

where

p_{θ} (τ) = p (s_{0}) t \prod π_{θ} (a_{t} ∣ s_{t}) p (s_{t + 1} ∣ s_{t}, a_{t}) .

Above we see that the expected return of a trajectory is the integration over the product of initial state probability, the policy output distribution and the state transition probability and the collected return from all possible trajectory.

The gradient with respect to the policy parameter

\nabla_{θ} J (θ) = \int \nabla_{θ} p_{θ} (τ) G_{0} (τ) d τ

Use the log gradient trick

\nabla_{θ} p_{θ} (τ) = p_{θ} (τ) \nabla_{θ} lo g p_{θ} (τ)

Expand we have

lo g p_{θ} (τ) = lo g p (s_{0}) + t \sum lo g (π_{θ} (a_{t} ∣ s_{t})) + t \sum lo g (p (s_{t + 1} ∣ s_{t}, a_{t}))

Taking gradient w.r.t. $θ$ gives

\nabla_{θ} lo g p_{θ} (τ) = t \sum \nabla_{θ} lo g (π_{θ} (a_{t} ∣ s_{t}))

since $lo g p (s_{0})$ and $lo g p (s_{t + 1} ∣ s_{t}, a_{t})$ do not depend on $θ$ .

Then we could write

\nabla_{θ} J (θ) = \int p_{θ} (τ) t \sum \nabla_{θ} lo g (π_{θ} (a_{t} ∣ s_{t})) G_{0} (τ) d τ = E_{τ \sim π_{θ}} [t \sum \nabla_{θ} lo g (π_{θ} (a_{t} ∣ s_{t})) G_{0} (τ)]

Note that $G_{0} (τ)$ contains all time step’s future return, however for Expectation at time step $t$ , by Markov property, rewards before time $t$ do not depend on action $a_{t}$ , so we can replace $G_{0} (τ)$ with $G_{t}$ , and write that

\nabla_{θ} J (θ) = t \sum E_{τ \sim π_{θ}} [\nabla_{θ} lo g (π_{θ} (a_{t} ∣ s_{t})) G_{t}] .

Which is the standard REINFORCE formulation , by law of total expectation

E_{Y} [E [X ∣ Y]] = E [X] = \int_{y} P (Y) E [X ∣ Y] d y = \int_{y} P (Y) (\int_{x} P (X ∣ Y) X d x) d y = \int_{y} P (Y) (\int_{x} \frac{P ( X , Y )}{P ( Y )} X d x) d y = \int_{y} \int_{x} P (X, Y) X d x d y = \int_{x} X (\int_{y} P (X, Y) d y) d x = \int_{x} P (X) X d x = E [X] .

and that $G (t)$ is a nosier realization of $Q (s_{t}, a_{t})$ if Q is exact (conditional expectation always reduce variance ⇐) , with law of total expectation, and conditioning on the state and action, we get the following

E_{τ \sim π_{θ}} [t \sum \nabla_{θ} lo g (π_{θ} (a_{t} ∣ s_{t})) E (G_{t} ∣ s_{t}, a_{t})] = E_{τ \sim π_{θ}} [t \sum \nabla_{θ} lo g (π_{θ} (a_{t} ∣ s_{t})) Q (s_{t}, a_{t})]

which is policy gradient, below is a detailed derivation, we look at a specific time-step

E_{τ \sim π_{θ}} [\nabla_{θ} lo g (π_{θ} (a_{t} ∣ s_{t})) G (t)] = E_{τ \sim π_{θ}} [E [\nabla_{θ} lo g (π_{θ} (a_{t} ∣ s_{t})) G (t) ∣ s_{t}, a_{t}]] = E_{τ \sim π_{θ}} [\nabla_{θ} lo g (π_{θ} (a_{t} ∣ s_{t})) E [G (t) ∣ s_{t}, a_{t}]] = E_{τ \sim π_{θ}} [\nabla_{θ} lo g (π_{θ} (a_{t} ∣ s_{t})) Q (s_{t}, a_{t})]

Note that $G (t)$ is a Monte-Carlo sample of the trajectory and is ill-posed with high variance, now our mission is to find ways to reduce the variance. first, we introduce a baseline $b (s)$ that is only state dependent, now we proceed to prove that such baseline does not create bias on the policy gradient.

E_{τ \sim π_{θ}} [\nabla_{θ} lo g (π_{θ} (a_{t} ∣ s_{t})) (Q (s_{t}, a_{t}) - b (s_{t}))]

Then

E_{τ \sim π_{θ}} [E [\nabla_{θ} lo g (π_{θ} (a_{t} ∣ s_{t})) b (s_{t}) ∣ s_{t}]] E [\nabla_{θ} lo g (π_{θ} (a_{t} ∣ s_{t}))] = E_{τ \sim π_{θ}} [b (s_{t}) E [\nabla_{θ} lo g (π_{θ} (a_{t} ∣ s_{t}))]] = a \sum π_{θ} (a_{t} ∣ s_{t}) \nabla_{θ} lo g (π_{θ} (a_{t} ∣ s_{t}))

Note that here we use the log trick again

\nabla_{θ} lo g (π_{θ} (a_{t} ∣ s_{t})) = \frac{\nabla π _{θ} ( a _{t} ∣ s _{t} )}{π _{θ} ( a _{t} ∣ s _{t} )}

The equation above become

a \sum π_{θ} (a_{t} ∣ s_{t}) \nabla_{θ} lo g (π_{θ} (a_{t} ∣ s_{t})) = a \sum \nabla_{θ} π_{θ} (a_{t} ∣ s_{t}) = \nabla_{θ} a \sum π_{θ} (a_{t} ∣ s_{t}) = \nabla_{θ} 1 = 0

So as long as we have a non action dependent baseline, we can arrive have an unbiased estimation of policy gradient.

Chengjing Yuan

Explorer

Policy Gradient From Scratch

Policy Gradient from Scratch

Graph View