Predictions in Reinforcement Learning
Mis à jour :
In a model-free setting, the transition probabilities are unknown and the agent must interact with the environment. An additional challenge is the possibility that the control policy is different to the one to estimate. This is called off-policy but we will focus on on-policy in this blogpost. We will leverage a simulator and a policy $\pi$ - coupled with our knowledge of $S, A, \gamma$ to run episodes and improve the latter from sampled data.
There are mainly four techniques, in increasing order of performance:
- MC methods
- Stochastic approximations
- Monte Carlo learning
- Temporal differences: quite useful in dealing with non Markovian processes
- $TD(0)$
- $TD(\lambda)$
In the previous blogposts, we went though the following topics:
- Dynamic Programming
- Solving explicitly MDPs via Tpi , T*
1. MC methods
The main principle behind MC method is to take simulations, record the value functions along the way and average estimates afterwards. Mathematically, we sample the samples $X_{1}, …, X_{n}$ (those are our value functions) and we compute the expectancy over all samples:
This is most valid when the assumptions that $X_{1}, …, X_{n}$ are i.i.d. strong, as we obtain:
where we used the notion of almost sure convergence:
I refer you to this statistics blogpost for more details.
1.1. Stochastic approximation
TODO: Goal of stochastic approximations:
Some common methods used are incremental means and Stochastic Gradient Descent.
- Incremental means: is simply a running averaging on the i.i.d. sampled values $X_{1}, …, X_{n}$:
By assuming the stationarity and taking the expectation:
- SGD is performed as such:
And once again, stationarity is assumed and we obtain:
1.2. Robbins Monro conditions
A step size $\eta_{n} \geq 0$ is said to satisfy the Robbins-Monro conditions if:
which just means “large but not too large”. In practice, the following ones are used:
- $\frac{1}{n^{\alpha}}$
- $\frac{1}{2}\leq \alpha \leq 1$
1.3. Stochastic approximations of fixed points
In our setting, we want to find the fixed point $V^{\text{*}}$ of the Bellman operator $\mathcal{T}$ (refer to past blogpost) using a noisy estimate $\hat{\mathcal{T}}V = \mathcal{T}V + b$. More precisely, we want to solve:
To do so, let’s consider the filtration $\mathcal{F} = { V_{0}, …, V_{n}, b_{0}, b_{n} }$ and the following assumptions:
Noise is ok! $\mathbb{E}[b_{n} \mathcal{F}_{n}] = 0$ Large noise is $\mathcal{O}(V_{n})$ $\exists K: \forall n, \mathbb{E}[ b_{n} ^{2} \mathcal{F}_{n} ] < K(1+ V_{n} )$ - Large but not too large: ${ \eta_{n} }_{n\in \mathbb{N}}$ satisfies the Robbins-Monro conditions
With all of the above holding, we have our desired convergence:
2. Monte-Carlo Learning
Works in sequential and finite episode setting. An episode is said to be finite when a terminal state is reached? It’s possible to have $\gamma = 1$
Monte-Carlo has no bias but variance.
Any episode is a sub-episode
We do kind of a bootstrapping
First visit MC: Use only the first sub-episode to visit a state (no bias but variance)
N is a counter that helps average the expected score
Every visit MC
3. Temporal difference learning
3.1. TD(0)
From epiosde $N$ to $N+1$
At each episode $N$, build $V_{N}$. At a state $k$, we compute the TD error:
with V_{n}(S_{k+1}) is a biased estimate of the future reward.
Define
then
3.1. $TD(\lambda)$
MC TD(0) TABLE
Can we get a trade off of the two? Yes! It is $TD(\lambda)$
\mathcal{T}_{\lambda}^{\pi} = (1- \lambda) \sum_{m\geq 0}\lambda^{m (\mathcal{T}^{\pi})^{m+1
Stochastic approximation theorem
This algorithm converges to a fixed point (DEMO)
Implementation with trace:
Laisser un commentaire