Basics of Deep Learning

1 minute de lecture

Mis à jour : December 01, 2018

Activation functions

	Logistic sigmoid	Hyperbolic Tangent	ReLU
Definition	$g(z) = \frac{1}{1 + exp(-z)}$	$g(z) = tanh(z)$	$g(z) = max (0, z)$
Saturation	Saturation	Saturation	No saturation
Usefullness	First activation function	First activation function	Works well in practice
At 0		$\approx id$	Non differentiable

Generalizations of ReLU: perform sometimes better.

Absolute value rectification: $g(z) = \mid z \mid$ useful for features invariant to polarity change.
Leaky ReLU: $g(z) = \text{max}(0, z) + \alpha \text{ min}(0, z)$ with $\alpha \approx 0.001$ fixed
Parametric ReLU (PReLU): $g(z) = \text{max}(0, z) + \alpha \text{ min}(0, z)$ where $\alpha$ is learned

Loss Function

Negative log-likelihood

Key idea: Maximizing the likelihood / minimizing the negative log-likelihood:

$J(\theta) = - \mathbb{E}_{(x, y) \sim p_{data}} \text{log} p_{model}(y\mid x)$

where:

$x$: network input
$y$: $x$’s label / expected value for $x$
$p_{data}$ is the distribution of $(x, y)$ for a training set.
$p_{model}(y \mid x):$ how we compute the probability for a value $y$ from the network output for $x$

Note: Negative log-likelihood = cross-entropy between the training data and the model distributions.

Mean Squared Error

If we choose $p_{model}(y\mid x) = \mathcal{N}(y; f(x, \theta), I)$, the negaive log-likelihood becomes:

$J(\theta) = - \mathbb{E}_{(x, y) \sim p_{data}} \mid\mid y-f(x;\theta) \mid\mid^{2}$

Keras: Final layer should be linear:

model.add(Dense(n)) # no activation function
model.compile(loss = 'mean_squared_error', optimizer = ..)

Categorical cross-entropy

Context (Multi-class classification) $y$ can take integer values in $[0, n[$

$J(\theta) = - \frac{1}{\mid \mathcal{T} \mid} \sum_{(x, i)\in \mathcal{T}} \text{log} p_{model}(y=i \mid x)$

Goal: Find $p$ with $p_{i} = p_{model}(y = i \mid x)$ such that $p_{i} \in [0, 1]$ and $\sum p_{i} = 1$

Softmax is a soft binarization of “maximum values returns 1, other values return 0”. Namely:

$p_{i} = \text{softmax}(z)_{i} = \frac{exp(z_{i})}{\sum exp(z_{j})}$

The loss function thus becomes:

$J(\theta) = - \frac{1}{\mid \mathcal{T} \mid} \sum_{(x, i)\in \mathcal{T}} \text{log } \text{softmax}(z(x))_{i}$

Intuition:

Squashes a vector of size $n$ between $0$ and $1$.
Normalization $\Rightarrow$ sum of this whole vector equates $1$.
Output of the softmax are the probabilities that the sample belongs to a certain class.

Keras:

from keras.utils import np_utils
y_train = np_utils.to_categorical(y_train, nb_classes)
model.add(Dense(n, activation = 'softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer = ..)
model.fit(X_train, Y_train, .. )

Generalization

We want to perform well on new, previously unseen inputs (generalization). Therefore, the quantity we truly want to minimize is the test error. More details about how to handle this can be found here.

Partager sur

Twitter Facebook Google+ LinkedIn

Louis de Vitry

Basics of Deep Learning

Activation functions

Loss Function

Negative log-likelihood

Mean Squared Error

Categorical cross-entropy

Generalization

Partager sur

Laisser un commentaire

Vous pourriez aimer

Spark introduction

Hadoop

Link prediction

Link prediction

Spreading through Network