Multi-Layer Perceptron

2 minute de lecture

Mis à jour : December 01, 2018

This post covers the history of Deep Learning, from the Perceptron to the Multi-Layer Perceptron Network.

1. Perceptron

Key idea (1958, Frank Rosenblatt)

Data are represented as vectors
Collect training data: some are positive examples, some are negative examples
Training: find $a$ and $b$ so that
- $a > x + b$ is positive for positive samples $x$
- $a > x + b$ is negative for negative samples $x$
Testing: the perceptron can now classify new examples.

Notes:

Training the perceptron: At the time, ad hoc algorithm:

Start from a random initialization
For each training sample $x$:
- compare the value of $a > x + b$ and its expected sign
- adapt a and b to get a better value for $a > x + b$

Note: The perceptron is roughly inspired from the neuron:

Perceptron vs. Neuron

Limitations: 1969, Perceptrons book, Minsky and Papert

Consequence: It is seen by many as a justification to stop research on perceptrons and entails the “AI winter” of the 1970s.

Key idea (1980 - Rumelhart, Hinton, Williams) “Chain” several perceptron together at different depth, with the help of a “squashing” function.

Multi Layer Perceptron

Formalism:

Input layer: $x$ is the same vector used for the perceptron.
Hidden layer: consists of perceptrons + squashing function
- Perceptrons weight their input $Wx + b$, where $W \in \mathbb{R}^{perceptrons\times features}$
- Squashing / activation function $h = g(Wx+b)$ “rescales” the input for the next layer.
Output: $y = W_{2}h+b_{2}$

Note: We can construct networks with arbitrary number of hidden layers, hence the usefullness of the activation function.

We can find $W, b, W_{2}$, and $b_{2}$ with the objective:

$\mathcal{L}(\mathcal{T}) = \sum_{x, d\in \mathcal{T}} \mid\mid y(x) - d \mid\mid^{2}$

where $\mathcal{T}$ is the training set (features $x$ and expected output $d$).

Note: A Multilayer Perceptron solves non-linearly separable problems such as the x-or function with $g(a) = max(a, 0)$

In practice:

Universal Approximation Theorem Hornik et al, 1989; Cybenko, 1989

Any continuous function can be approximated by a two-layer network (with ReLU activation).

The function learned by a Deep Neural Network with the ReLU operator is: