Introduction to dynamic programming
Mis à jour :
Dynamic Programming is a method for solving a complex problem by breaking it down into a collection of simpler subproblems, solving (often recursively) each of those subproblems just once, and storing their solutions using a memory-based data structure (array, map,etc).
As Jonathan Paulson puts it, Dynamic Programming is just a fancy way to say “remembering stuff to save time later”.
Applied to Reinforcement Learning, Dynamic Programming provide a theoretically powerful model framework for:
- Control: Finding a policy $\pi^{*}$ that maximises the cumulated reward
- Prediction: Finding the value of each state $v^{\pi}$
In this blogpost, we will overview the fundamental role of Dynamic Programing in Reinforcement Learning. Most notably, we will:
- Provide a proof of the optimal policy theorem, which states that there exists (not necessarily unique) a deterministic policy $\pi^{}$ that satisfies $v_{\pi}(s) \leq v_{\pi^{}}(s),\ \forall \pi\ \forall s$
- Give an algorithm that find such a solution.
Introduction to planning
TL; DR: Planning with Dynamic Programming requires a transition $p(s’\mid a, s)$ and a reward function $r(s, a)$. From there, it will iteratively compute a value function $q^{\pi}$ (also called $Q$-value) and its corresponding policy $\pi$; ultimately returning an "optimal" policy $\pi^{*}$. Overall, it involves a decision making process, which is possibly model based $p_{\theta}(s’\mid a, s))$.
The need of a Markov Decision Process
Mathematically speaking, a planning task consists in estimating $\pi^{}$ or $v^{}$ from a Markov Decision Process $\mathcal{X}$, defined by:
where $\mathcal{S}$ and $\mathcal{A}$ are the state and actions space and $\gamma$ a discount factor.
If you are unfamiliar with the mathematical framework of reinforcement learning, you can refer to this previous blogpost, or a more layman suited explanation.
Hardness of the task
Recalling that $\pi: \mathcal{S} \times \mathcal{A} \mapsto \mathbb{R}_{+}$, the number of deterministic policies is $\mid\mathcal{A}\mid^{\mathcal{S}}$. It therefore becomes impractical for larger state set $\mathcal{S}$.
To mitigate this issue, we typically use look-up tables, which provide an efficient storage of the state and actions space. Each of the subproblem solutions is indexed in some way, typically based on the values of its input parameters, so as to facilitate its lookup. So the next time the same subproblem occurs, instead of recomputing its solution, one simply looks up the previously computed solution, thereby saving computation time. This technique of storing solutions to subproblems instead of recomputing them is called memoization.
Task
Optimal Policy theorem: There exists (not necessarily unique) a deterministic policy $\pi^{}$ that satisfies $v_{\pi}(s) \leq v_{\pi^{}}(s),\ \forall \pi\ \forall s$
whose direct consequences are:
Estimating Q-value with the expectancy over a policy?
The state-value function (or Q-function) is defined as:
The meaning of the expectancy over a policy $E_{\pi}$ is more thoroughly described in this previous blogpost. As for the optimal $Q$-value, it is often denoted as:
Probability distribution being hard to estimate as the dimensions of the problem increases (i.e. curse of dimensionality), it becomes challenging to compute the expectancy as is. To solve this problem, we turn to the fixed-point theorem.
Fixed-point optimal value function
Let’s recall the Bellman equation:
which can be written in a more readable fashion:
Given this is a linear equation, that can be thought as a fixed point. Our goal is therefore to derive an optimal policy $v^{\pi^{*}}$ using the fixed-point theorem:
Fixed-point theorem: Let $(E, d)$ be a complete space, and $f:E \mapsto E$ which satisfies the contraction property:
then $x_{n+1} = f(x_{n})$ converges to an unique $x^{*}$ with
Bellman operators
We show in this section that the Bellman equation satisfies the contraction property (i.e. the assumptions of the fixed-point theorem). To this end, we will consider the problem in terms of operators; an operator being a mapping that acts on elements of a space to produce other elements of the same space. Specifically, we will consider two, dercribed below:
- The linear operator $\mathcal{T}^{\pi}$ takes as input a present $Q$-function $v$ and a state, and output updated the immediat reward plus the $\gamma$ discounted sum of present $Q$-values $v(s’)$, weighted by their corresponding transitions probabilities $p(s’\mid s, \pi(s))$.
- The non-linear operator $\mathcal{T}^{*}$ basically retrieves the best reachable $Q$-value from present $Q$-values and the current state $s$:
The bellman operators are order preserving:
which can be thought of as an isometry (the proportions are conserved). Using this, we can derive the contraction property easily:
Using the fixed-point theorem, we obtain two major results:
- It exists a unique $v^{}$</b> such that $\mathcal{T}^{}v^{} = v^{}$: This means that there is a solution for which we cannot obtain a better $Q$-value.
- $\mathcal{T}^{\pi}v^{\pi} = v^{\pi}$ is the unique fixed point, which means that under the policy $\pi$, remaining in the state $s$ provide the best $Q$-value. It can be thought of as an eigenvector of the operator.
Careful: In our setting, we consider only deterministic policies.
Definition: A policy $\pi’$ is said to act greedy with respect to $\pi$ if:
Greedy policy lemma:
- $v_{\pi} \leq v_{\pi’}$
- $v_{\pi’}(s) = v_{\pi’}(s) \Leftrightarrow v_{\pi}(s) = v^{*}(s)$
This is a way to design a better policy from an existing policy, which will ultimately be used in an iterative manner.
Proof of the lemma: Recall that
Then:
thus yielding
which in turn implies for large $k\rightarrow\infty$
Proof
The proof consists of exhibiting a candidate policy using the Greedy policy lemma.
Let us fix a policy $\pi_{0} = \pi$. At step $n$, we consider $\pi_{n+1}$ the greedy policy with respect to $\pi_{n}$ : $\pi_{n+1}(s) = \text{arg max}{a}q^{\pi{n}}(s)$
Using the greedy theorem, we obtain:
As $\mathcal{T}^{*}$ is continuous, there exists $\tilde{v}$ such that:
Finally, $\tilde{v}$ is the unique fixed point of $\mathcal{T}^{*}$ and $v^{\pi} \leq \tilde{v}$.
Reminder: non-decreasing sequences are always convergent.
Yet, the set of policy ${\pi_{1} , …, \pi_{n} , …}$ is finite thus, $\exists N, \forall n \geq N, v^{\pi_{n}} = v^{\pi_{0}}$. Consequently, there must exist a policy $\pi^{}$ in this set such that $v^{\pi^{}} = \tilde{v}$. As $\pi_{0}$ was arbitrarily chosen, the result hold for all $\pi$:
Finally:
Optimal bell man equation (1)
One can verify that (q^{}, v^{}) are optimal if and only if:
Also, a policy is optimal if and only if:
DP algorithms
Prediction
How to obtain $v^{\pi}$ or $v^{*}$? Iterate an operator until convergence:
def iterative_policy_evaluation(pi, max_iter):
"""
Evaluate the rewards
"""
# Initialize v to some suited structure (here QValue() by lack of a better name :) )
v = QValue()
# Apply
for k in range(0, max_iter):
v = linear_bellman_operator(pi, v)
return(v)
def value_iteration(pi, max_iter):
"""
Gives the best reachable Q-value
"""
# Initialize v to some suited structure (here QValue() by lack of a better name :) )
v = QValue()
# Apply
for k in range(0, max_iter):
v = nonlinear_bellman_operator(pi, v)
return(v)
Convergence: The error is in
Complexity The complexisty is mostly due to matrix multiplications of MDP:
Memory: Consists of the MDP storage and the value function storage
Two approaches for combining policy evaluation and value iteration
Synchronous algorithms
def iterative_policy_evaluation(S, T_non_linear, max_iter):
"""
Evaluate the rewards
"""
# Initialize v to some suited structure (here QValue() by lack of a better name :) )
v = QValue()
# Apply
for k in range(0, max_iter):
for s in S:
v_temp(s)
v = linear_bellman_operator(pi, v)
return(v)
def value_iteration(pi, max_iter):
"""
Gives the best reachable Q-value
"""
# Initialize v to some suited structure (here QValue() by lack of a better name :) )
v = QValue()
# Apply
for k in range(0, max_iter):
v = nonlinear_bellman_operator(pi, v)
return(v)
Theorem: The asynchronous algorithm converges to $v^{*}$ if the algorithm visits infinitely often each state of the MDP.
Control https://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_dp.html https://medium.com/@codingfreak/top-50-dynamic-programming-practice-problems-4208fed71aa3 https://www.quora.com/How-should-I-explain-dynamic-programming-to-a-4-year-old/answer/Jonathan-Paulson https://medium.com/@curiousily/solving-an-mdp-with-q-learning-from-scratch-deep-reinforcement-learning-for-hackers-part-1-45d1d360c120 ——
Laisser un commentaire