Unconstrained optimization
Mis à jour :
Main assumption: $f: \mathbb{R}^{n} \rightarrow \mathbb{R}$ is $\mathcal{C}^{1}$ or $\mathcal{C}^{2}$.
Theorem (Existence of a local minimizer): Given $\mathcal{U}$ a closed subset of $\mathbb{R}^{n}$ and a continuous function $f: \mathcal{U} \subset \mathbb{R}^{n} \rightarrow \mathbb{R}$
- If $\mathcal{U}$ is bounded
- Or if $lim_{\mid\mid x \mid\mid \rightarrow \infty \text{ and }x \in \mathcal{U}} f(x) = + \infty$
then $f$ admits a local minimum on $\mathcal{U}$.
Searching minimum
Line search strategy
To update an iterative algorithm and move from $x_{k}$ to $x_{k+1}$, we use available informations and:
- Choose a direction $d_{k}$ .
- Solve (approximately) the 1D minimization problem:
- Set $x_{k+1} = x_{k}+ td_{k}$
Feasible direction
$d$ is a feasible direction at $x \in \mathcal{D}$ if:
Descent direction
- $d$ is a descent direction of $f$ at $x \in \mathcal{D}$ if
- First order Taylor expansion:
- $d$ is a descent direction $\Leftrightarrow d^{T}\nabla f(x) < 0$ (half space of descent directions)
Level sets: $\mathcal{L}_{z} = { x \in \mathbb{R}^{n}: f(x) = z }$
Derivatives
First order Taylor expansion:
Second order Taylor expansion:
Directional derivative in the direction d:
If $d$ is tangent to the level set, then $f’(x; d) = d^{T} \nabla f(x) = 0$. (WHY?)
Necessary condition for a local minimizer
Problem: Find $x^{*} \in \text{arg min}_{x \in \mathbb{R}^{n}} f(x)$
Theorem: If f admits a local minimum at $x^{}$, then $\nabla f(x^{}) = 0.$
Remark. The condition is necessary but not sufficient.
Sufficient condition for a local minimizer
Definition (Positive definiteness) $H(x^{}) > 0 \Leftrightarrow \forall d \neq 0, d^{T}H(x^{})d > 0$
Theorem: If $\nabla f(x^{}) = 0$ and $H(x^{}) > 0$, then $f$ admits a local minimum at $x^{*}$.
Descent algorithms
Blueprint
The basic idea is to:
- Choose $x_{0}$, and then for $k \geq 0$,
- Unimodal criterion: any value of $x_{0}$ should work (theoretically…)
- Find a descent direction $d_{k}$
- Constant step: $t_{k} = t, ∀k$.
- 1D minimisation: $t_{k} =argmin_{t>0} f(x_{k} + td_{k})$.
- Find a step $t_{k} >0$.
- Compute $x_{k+1} = x_{k} +t_{k}d_{k}$.
As for the descent, we require: $\forall k \text{ } f(x_{k+1}) < f(x_{k})$
Furhermore, we impose stopping conditions to avoid “getting stuck”:
- $f(x_{k}) - f(x_{k+1}) < \epsilon_{1}$
- $\mid\mid x_{k} - x_{k+1}\mid\mid < \epsilon_{2}$
- $\mid\mid \nabla f(x_{k})\mid\mid < \epsilon_{3}$
- $k = K_{max}$
Gradient based algorithms
Simply choose $d_{k} := \nabla f(x_{k})$
If the step is chosen as
then $\nabla f(x_{k})$ and $\nabla f(x_{k+1})$ are orthogonal, which implies slow (linear) convergence.
Typically, for the choice of descent direction we impose that:
Line minimization
How to compute:
- Steepest descent: exact minimization.
- Constant step $t_{k} = t$.
- Risk of small steps.
- Does not guarantee that $f(x_{k} + td_{k})<f(x_{k}) \forall k$.
- Reajust $t = t/2$.
- Dichotomy, golden section.
- Quadratic approximation of $\phi:t→f(x_{k} +td_{k})$.
- Quadratic or cubic approximation of $\phi$ from the knowledge of $\phi’(0)$.
- Armijo and Wolfe’s rules: ensure that $t$ is “acceptable” (sufficient decrease of $f$, sufficiently large step $t$).
Step selection: Armijo’s rule
Choose $t$ such that $f(x_{k} +td_{k}) \leq f(x_{k})+ c_{1} t f′(x_{k};d_{k})\text{ with }0< c_{1} <1$.
Let $\phi(t) := f(x_{k}+t d_{k})$ then the rule rewrites as:
with $\phi’(0) = \nabla f(x_{k})^{T} d_{k}$
Step selection: Wolfe’s rule
Choose t such that:
Convergence of Gradient Descent
Gradient descent converges under very mild assumptions. For example, it is possible to prove a convergence rate with the only a smoothness assumption (not even convexity). We will assume $f$ is differentiable and its gradient is $L$-Lipschitz.
Then
Conjugate gradient algorithm
The conjugate Gradient algorithm:
- Choose $x_{0}$
- For $k \geq 0$, do $x_{k+1} \leftarrow x_{k} - t_{k} \nabla f(x_{k}) + beta_{k}(x_{k}-x_{k-1})$
- The convergence rate is much faster
For $k \geq 0$,
Fletcher-Reeves: $\beta_{k}^{FR} = \frac{\mid\mid \nabla f(x_{k}) \mid\mid^{2}}{\mid\mid \nabla f(x_{k-1}) \mid\mid^{2}}$
Polak-Ribière: $\beta_{k}^{PR} = \frac{[\nabla f(x_{k})]^{T}(\nabla f(x_{k}) - \nabla f(x_{k-1}))}{\mid\mid \nabla f(x_{k-1}) \mid\mid^{2}}$
Subspace optimization (memory gradient).
Laisser un commentaire