Introduction to Machine Learning

9 minute de lecture

Mis à jour : October 01, 2018

In a nutshell, Machine Learning is a sub-field of Artificial Intelligence whose aim is to convert experience into expertise of knowledge. Born in the 1960s, it quickly grew as a separated field because of a focus shift from decisional AI (logical and knowledge-based approach); its aim is not General Artificial Intelligence but rather tackle solvable problems of a practical nature. And ever since the 1990s, Machine Learning is progressing and flourishing rapidly. This success can largely be attributed to two factors:

Moore’s law and the advent of the Internet; making it possible to analyze vast quantities of data rapidly.
The strength of established field such as data mining, optimization and statistics.

The term in itself was first coined in 1959 by Arthur Lee and Tom Mitchell, who define Machine Learning as:

“A computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$ if its performance at tasks in $T$, as measured by $P$, improves with experience $E$.”

Introduction
- Why Machine Learning?
- What is Machine Learning?
Taxonomy of of learning
- Unsupervised learning
- Supervised learning
Machine Learning Pipeline
Optimization
- Introduction
- Context and terminology
Expected Risk Minimization
ERM with inductive bias
- The realizability assumption
- The i.i.d. assumption
Probability of getting a wrong sample

1. Why Machine Learning and what is it?

1.1. Why Machine Learning?

We need Machine Learning because there are tasks that change with time (adaptivity) that are too complex to program like tasks performed by humans/animals and tasks beyond human abilities:

Recognizing patterns: Speech Recognition, facial identity, etc
Recommender Systems: Noisy data, commercial pay-off (e.g., Amazon, Netflix)
Information retrieval: Find documents or images with similar content
Computer vision: detection, segmentation, depth estimation, optical flow, etc.
Robotics: perception, planning, autonomous driving etc
Play games: beat humans at complicated games (Go)

Specifically, the use of learning algorithms are relevant in areas where:

Human expertise does not exist (bioinformatics, astrophysics)
Humans are unable to explain their expertise (speech recognition, computer vision)
Complex solutions change in time (routing computer networks)

1.2. What is Machine Learning?

Learning systems are not directly programmed to solve a problem, instead develop their own program based on how they should behave / trial-and-error. Specifically, such a program differs from the informatic paradigm:

	Classical program	ML program
Input	Data + Rules	Answers
Output	Data + Answers	Rules

2. Taxonomy of of learning

Machine Learning algorithms differ among each other depending on several factors; mainly related to the input of the program:

Framework	Role	Teacher	Protocol
Supervised	Active	Helpful	Online
Unsupervised	Passive	Adversarial	Batch learning

The most important factor - the framework - is adressed in detail in this section.

2.1. Unsupervised learning

The main task of an unsupervised algorithm is to create a new representation $\tilde{X}$ of the data $X$. In practice, this representation is often not a finality but rather a mean to another end. This is the case with PCA for instance, which performs dimensionality reduction for:

Reducind storage space and computational time
Removing redundancies
Weakening the curse of dimensionality
Visualizing (in 2 or 3 dimensions) and interpreting the data

Another popular class of unsupervised algorithms are data clustering algorithm, whose goal is roughly to construct groups of samples sharing similar traits. Among others, it allows to:

Understand general characteristics of the data
Infer some properties of an object based on how it relates to other objects

2.2. Supervised learning

Goal: Make predictions: Data + Labels -> Model -> Predictor

Classification: Make discrete predictions + example
Regression: Make real-valued predictions + example

Definition: (Hypothesis class F:) It is the space of possible decision functions we are considering. It answers the question “What shape do you think the discriminant should take?” and is chosen based on our beliefs about the problem.

Definition: (Loss function) Quantifies how far the decision function is from the truth?

Goal: we want a predictor $f$ with small loss on unseen data (e.g., on a test set).

TODO:

Format goal
Examples of classification vs regression
One sentence to introduce formalism
Readability

3. Machine Learning Pipeline

Most Machine Learning problems are solved by the following pipeline:

Input data
Feature and model selection
Training model on training set (done via optimization)
Computing train + test output

And the goals are:

Use “informative” features
Pick a predictor that is
- expressive
- easy to train
- doesn’t overfit
Train a model to minimize training error (see optimiation)

TODO:

Add preprocessing / normalization
Emphasize on evaluation
Add one TLDR

4. Optimization

4.1. Introduction

TODO:

Readability
Give examples for overfitting

Most Machine Learning algorithms - if all - learn by minimizing some error.

A fundamental problem is that the true error is unavailable to the learner. Because the input of the algorithm is a sample $S$ and not the distribution $\mathcal{D}$ that generated the data.

This can have dramatic effects on the implementation of sound Machine Learning pipelines. The main one is overfitting, in which the learner becomes extremly good on the training set (it kind of memorizes it) but fails to generalizes to the remaining data.

4.2. Context and terminology

TODO: Proofread

We will mainly focus on supervised learning, in which the error is the difference between our algorithm results and the expected results. Let’s recall that a learning algorihm receives as input a training set $S$ sampled from an unknown distribution $\mathcal{D}$ and labeled by some target function $f$ (what we want to approximate).

5. Expected Risk Minimization

5.1. Basic definitions

TODO:

Proofread
Add TLDR (introduce formalism)

Definition: (Empirical error) For a predictor $h \in \mathbb{H}$, the mean error the classifier over the training sample:

$L_{S}(h) := \frac{\mid \{ i \in [m]:h(x_{i}) \neq y_{i}\} \mid}{m} = \frac{\text{Number of errors among all samples}}{\text{Numbers of samples}}$

where $[m] = {1, …, m }$.

Expected Risk Minimization Find a predictor $h$ that minimizes $L_{S}(h)$.

5.2. Overfitting

Problem: For a predictor $h \in \mathcal{H}$ we can over-learn on the training sample and it can lead to poor generalization.

A common solution is to apply the ERM learning rule over a restricted search space which we called a hypothesis class. Formally, it translates to this: among a set of predictor $\mathcal{H}$, solve:

$ERM_{\mathcal{H}}(S) \in argmin_{h\in \mathcal{H}} L_S(h)$

TODO:

Give example
Readability

5.3. Inductive bias

Definition: (Inductive bias) By restricting the learner to choosing a predictor from H, we bias it toward a particular set of predictors. It turns out that the incorporation of prior knowledge, biasing the learning process, is inevitable for the success of learning algorithms (this is formally stated and proved as the “No-Free-Lunch theorem” in Chapter 5).

Since the choice of such a restriction is determined before the learner sees the training data, it should ideally be based on some prior knowledge about the problem to be learned.

Roughly speaking, the stronger the prior knowledge (or prior assumptions) that one starts the learning process with, the easier it is to learn from further examples. However, the stronger these prior assumptions are, the less flexible the learning is – it is bound, a priori, by the commitment to these assumptions.

Tradeoff of using a hypothesis class:

More restricted hypothesis class better protects us against overfitting
At the same time might cause us a stronger inductive bias.

6. ERM with inductive bias

Big question: In learning theory, over which hypothesis classes ERM learning will not result in overfitting?

Key takeaway: By putting an upperobound on the size of H, we can show that the ERM will not overfit, provided it is based on a sufficiciently large training sample.

Let’s apply $ERM_{\mathcal{H}}$ on our training sample S

$h_{S} \in \text{argmin}_{h\in \mathcal{H}}$

Prior to delving into the result, let’s focus on the assumptions behind the

TODO:

Readability
Add a TLDR

6.1. The realizability assumption

Definition: (Realizability assumption) There exists $h^{\text{}} \in \mathcal{H}$ s.t. $L_{(\mathcal{D}, f)}(h^{\text{}}) = 0$

It basically means that there exists a predictor that have loss $0$ on the distribution (which imply that the loss on the training set is $0$).

6.2. The i.i.d. assumption

Definition: (i.i.d.) The examples in the training set are independently and identically distributed (i.i.d.) according to the distribution $\mathcal{D}$

Intuitively: Training set $S$ is a window through which the learner gets partial information about the distribution D over the world and the labeling function, $f$. The larger the sample gets, the more likely it is to reflect more accurately the distribution and labeling used to generate it.

Definitions:

$\sigma$ be the probability of getting a nonrepresentative sample
$(1 - \sigma)$ the confidence parameter of our prediction.
$\epsilon$ is the accuracy parameter

TODO:

Readability
Add a TLDR

7. Probability of getting a wrong sample

Goal: Find an upper bounding the probability to sample $m$-tuple of in stances that will lead to failure of the learner.

Theorem:

$\mathbb{P}(\{ S_{x} \mid L_{(\mathcal{D}, f)}(h_{S}) > \epsilon \}) \mid \mathcal{H}\mid e^{-\epsilon m}$

Corollary: Let $\mathcal{H}$ be finite hypothesis class. Let $\delta \in (0, 1)$ and $\epsilon > 0$ and let $m$ be an integer that satisfies:

$m \geq \frac{log(\mid\mathcal{h}\mid)/\delta}{\epsilon}$

Then, for any labeling function f and for any distribution $\mathcal{D}$, for which the realizability assumption holds with probability at least $1-\delta$ over hte choice of an i.i.d. sample S of size m, we have that for every ERM hypothesis h_{S}, it holds that:

$L_{(D, f)}(h_{S}) \leq \epsilon$

TODO:

Readability
Add a TLDR

TODO: Add future work

8. Sources

Partager sur

Twitter Facebook Google+ LinkedIn

Louis de Vitry

Introduction to Machine Learning

1. Why Machine Learning and what is it?

1.1. Why Machine Learning?

1.2. What is Machine Learning?

2. Taxonomy of of learning

2.1. Unsupervised learning

2.2. Supervised learning

3. Machine Learning Pipeline

4. Optimization

4.1. Introduction

4.2. Context and terminology

5. Expected Risk Minimization

5.1. Basic definitions

5.2. Overfitting

5.3. Inductive bias

6. ERM with inductive bias

6.1. The realizability assumption

6.2. The i.i.d. assumption

7. Probability of getting a wrong sample

8. Sources

Partager sur

Laisser un commentaire

Vous pourriez aimer

Spark introduction

Hadoop

Link prediction

Link prediction

Spreading through Network