Unconstrained optimization

3 minute de lecture

Mis à jour :

Main assumption: $f: \mathbb{R}^{n} \rightarrow \mathbb{R}$ is $\mathcal{C}^{1}$ or $\mathcal{C}^{2}$.

Theorem (Existence of a local minimizer): Given $\mathcal{U}$ a closed subset of $\mathbb{R}^{n}$ and a continuous function $f: \mathcal{U} \subset \mathbb{R}^{n} \rightarrow \mathbb{R}$

  • If $\mathcal{U}$ is bounded
  • Or if $lim_{\mid\mid x \mid\mid \rightarrow \infty \text{ and }x \in \mathcal{U}} f(x) = + \infty$

then $f$ admits a local minimum on $\mathcal{U}$.

Searching minimum

Line search strategy

To update an iterative algorithm and move from $x_{k}$ to $x_{k+1}$, we use available informations and:

  1. Choose a direction $d_{k}$ .
  2. Solve (approximately) the 1D minimization problem:
  3. Set $x_{k+1} = x_{k}+ td_{k}$

Feasible direction

$d$ is a feasible direction at $x \in \mathcal{D}$ if:

Descent direction

  • $d$ is a descent direction of $f$ at $x \in \mathcal{D}$ if
  • First order Taylor expansion:
  • $d$ is a descent direction $\Leftrightarrow d^{T}\nabla f(x) < 0$ (half space of descent directions)

Level sets: $\mathcal{L}_{z} = { x \in \mathbb{R}^{n}: f(x) = z }$

Derivatives

First order Taylor expansion:

Second order Taylor expansion:

Directional derivative in the direction d:

If $d$ is tangent to the level set, then $f’(x; d) = d^{T} \nabla f(x) = 0$. (WHY?)

Necessary condition for a local minimizer

Problem: Find $x^{*} \in \text{arg min}_{x \in \mathbb{R}^{n}} f(x)$

Theorem: If f admits a local minimum at $x^{}$, then $\nabla f(x^{}) = 0.$

Remark. The condition is necessary but not sufficient.

Sufficient condition for a local minimizer

Definition (Positive definiteness) $H(x^{}) > 0 \Leftrightarrow \forall d \neq 0, d^{T}H(x^{})d > 0$

Theorem: If $\nabla f(x^{}) = 0$ and $H(x^{}) > 0$, then $f$ admits a local minimum at $x^{*}$.

Descent algorithms

Blueprint

The basic idea is to:

  1. Choose $x_{0}$, and then for $k \geq 0$,
    • Unimodal criterion: any value of $x_{0}$ should work (theoretically…)
  2. Find a descent direction $d_{k}$
    • Constant step: $t_{k} = t, ∀k$.
    • 1D minimisation: $t_{k} =argmin_{t>0} f(x_{k} + td_{k})$.
  3. Find a step $t_{k} >0$.
  4. Compute $x_{k+1} = x_{k} +t_{k}d_{k}$.

As for the descent, we require: $\forall k \text{ } f(x_{k+1}) < f(x_{k})$

Furhermore, we impose stopping conditions to avoid “getting stuck”:

  • $f(x_{k}) - f(x_{k+1}) < \epsilon_{1}$
  • $\mid\mid x_{k} - x_{k+1}\mid\mid < \epsilon_{2}$
  • $\mid\mid \nabla f(x_{k})\mid\mid < \epsilon_{3}$
  • $k = K_{max}$

Gradient based algorithms

Simply choose $d_{k} := \nabla f(x_{k})$

If the step is chosen as

then $\nabla f(x_{k})$ and $\nabla f(x_{k+1})$ are orthogonal, which implies slow (linear) convergence.

Typically, for the choice of descent direction we impose that:

Line minimization

How to compute:

  • Steepest descent: exact minimization.
  • Constant step $t_{k} = t$.
    • Risk of small steps.
    • Does not guarantee that $f(x_{k} + td_{k})<f(x_{k}) \forall k$.
    • Reajust $t = t/2$.
  • Dichotomy, golden section.
  • Quadratic approximation of $\phi:t→f(x_{k} +td_{k})$.
  • Quadratic or cubic approximation of $\phi$ from the knowledge of $\phi’(0)$.
  • Armijo and Wolfe’s rules: ensure that $t$ is “acceptable” (sufficient decrease of $f$, sufficiently large step $t$).

Step selection: Armijo’s rule

Choose $t$ such that $f(x_{k} +td_{k}) \leq f(x_{k})+ c_{1} t f′(x_{k};d_{k})\text{ with }0< c_{1} <1$.

Let $\phi(t) := f(x_{k}+t d_{k})$ then the rule rewrites as:

with $\phi’(0) = \nabla f(x_{k})^{T} d_{k}$

Step selection: Wolfe’s rule

Choose t such that:

Convergence of Gradient Descent

Gradient descent converges under very mild assumptions. For example, it is possible to prove a convergence rate with the only a smoothness assumption (not even convexity). We will assume $f$ is differentiable and its gradient is $L$-Lipschitz.

Then

Conjugate gradient algorithm

The conjugate Gradient algorithm:

  • Choose $x_{0}$
  • For $k \geq 0$, do $x_{k+1} \leftarrow x_{k} - t_{k} \nabla f(x_{k}) + beta_{k}(x_{k}-x_{k-1})$
  • The convergence rate is much faster

For $k \geq 0$,

  • Fletcher-Reeves: $\beta_{k}^{FR} = \frac{\mid\mid \nabla f(x_{k}) \mid\mid^{2}}{\mid\mid \nabla f(x_{k-1}) \mid\mid^{2}}$

  • Polak-Ribière: $\beta_{k}^{PR} = \frac{[\nabla f(x_{k})]^{T}(\nabla f(x_{k}) - \nabla f(x_{k-1}))}{\mid\mid \nabla f(x_{k-1}) \mid\mid^{2}}$

Subspace optimization (memory gradient).


Laisser un commentaire