Introduction to Machine Learning
Mis à jour :
In a nutshell, Machine Learning is a sub-field of Artificial Intelligence whose aim is to convert experience into expertise of knowledge. Born in the 1960s, it quickly grew as a separated field because of a focus shift from decisional AI (logical and knowledge-based approach); its aim is not General Artificial Intelligence but rather tackle solvable problems of a practical nature. And ever since the 1990s, Machine Learning is progressing and flourishing rapidly. This success can largely be attributed to two factors:
- Moore’s law and the advent of the Internet; making it possible to analyze vast quantities of data rapidly.
- The strength of established field such as data mining, optimization and statistics.
The term in itself was first coined in 1959 by Arthur Lee and Tom Mitchell, who define Machine Learning as:
“A computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$ if its performance at tasks in $T$, as measured by $P$, improves with experience $E$.”
- Introduction
- Taxonomy of of learning
- Machine Learning Pipeline
- Optimization
- Expected Risk Minimization
- ERM with inductive bias
- Probability of getting a wrong sample
1. Why Machine Learning and what is it?
1.1. Why Machine Learning?
We need Machine Learning because there are tasks that change with time (adaptivity) that are too complex to program like tasks performed by humans/animals and tasks beyond human abilities:
- Recognizing patterns: Speech Recognition, facial identity, etc
- Recommender Systems: Noisy data, commercial pay-off (e.g., Amazon, Netflix)
- Information retrieval: Find documents or images with similar content
- Computer vision: detection, segmentation, depth estimation, optical flow, etc.
- Robotics: perception, planning, autonomous driving etc
- Play games: beat humans at complicated games (Go)
Specifically, the use of learning algorithms are relevant in areas where:
- Human expertise does not exist (bioinformatics, astrophysics)
- Humans are unable to explain their expertise (speech recognition, computer vision)
- Complex solutions change in time (routing computer networks)
1.2. What is Machine Learning?
Learning systems are not directly programmed to solve a problem, instead develop their own program based on how they should behave / trial-and-error. Specifically, such a program differs from the informatic paradigm:
Classical program | ML program | |
---|---|---|
Input | Data + Rules | Answers |
Output | Data + Answers | Rules |
2. Taxonomy of of learning
Machine Learning algorithms differ among each other depending on several factors; mainly related to the input of the program:
Framework | Role | Teacher | Protocol |
---|---|---|---|
Supervised | Active | Helpful | Online |
Unsupervised | Passive | Adversarial | Batch learning |
The most important factor - the framework - is adressed in detail in this section.
2.1. Unsupervised learning
The main task of an unsupervised algorithm is to create a new representation $\tilde{X}$ of the data $X$. In practice, this representation is often not a finality but rather a mean to another end. This is the case with PCA for instance, which performs dimensionality reduction for:
- Reducind storage space and computational time
- Removing redundancies
- Weakening the curse of dimensionality
- Visualizing (in 2 or 3 dimensions) and interpreting the data
Another popular class of unsupervised algorithms are data clustering algorithm, whose goal is roughly to construct groups of samples sharing similar traits. Among others, it allows to:
- Understand general characteristics of the data
- Infer some properties of an object based on how it relates to other objects
2.2. Supervised learning
Goal: Make predictions: Data + Labels -> Model -> Predictor
- Classification: Make discrete predictions + example
- Regression: Make real-valued predictions + example
Definition: (Hypothesis class F:) It is the space of possible decision functions we are considering. It answers the question “What shape do you think the discriminant should take?” and is chosen based on our beliefs about the problem.
Definition: (Loss function) Quantifies how far the decision function is from the truth?
Goal: we want a predictor $f$ with small loss on unseen data (e.g., on a test set).
TODO:
- Format goal
- Examples of classification vs regression
- One sentence to introduce formalism
- Readability
3. Machine Learning Pipeline
Most Machine Learning problems are solved by the following pipeline:
- Input data
- Feature and model selection
- Training model on training set (done via optimization)
- Computing train + test output
And the goals are:
- Use “informative” features
- Pick a predictor that is
- expressive
- easy to train
- doesn’t overfit
- Train a model to minimize training error (see optimiation)
TODO:
- Add preprocessing / normalization
- Emphasize on evaluation
- Add one TLDR
4. Optimization
4.1. Introduction
TODO:
- Readability
- Give examples for overfitting
Most Machine Learning algorithms - if all - learn by minimizing some error.
A fundamental problem is that the true error is unavailable to the learner. Because the input of the algorithm is a sample $S$ and not the distribution $\mathcal{D}$ that generated the data.
This can have dramatic effects on the implementation of sound Machine Learning pipelines. The main one is overfitting, in which the learner becomes extremly good on the training set (it kind of memorizes it) but fails to generalizes to the remaining data.
4.2. Context and terminology
TODO: Proofread
We will mainly focus on supervised learning, in which the error is the difference between our algorithm results and the expected results. Let’s recall that a learning algorihm receives as input a training set $S$ sampled from an unknown distribution $\mathcal{D}$ and labeled by some target function $f$ (what we want to approximate).
5. Expected Risk Minimization
5.1. Basic definitions
TODO:
- Proofread
- Add TLDR (introduce formalism)
Definition: (Empirical error) For a predictor $h \in \mathbb{H}$, the mean error the classifier over the training sample:
where $[m] = {1, …, m }$.
Expected Risk Minimization Find a predictor $h$ that minimizes $L_{S}(h)$.
5.2. Overfitting
Problem: For a predictor $h \in \mathcal{H}$ we can over-learn on the training sample and it can lead to poor generalization.
A common solution is to apply the ERM learning rule over a restricted search space which we called a hypothesis class. Formally, it translates to this: among a set of predictor $\mathcal{H}$, solve:
TODO:
- Give example
- Readability
5.3. Inductive bias
Definition: (Inductive bias) By restricting the learner to choosing a predictor from H, we bias it toward a particular set of predictors. It turns out that the incorporation of prior knowledge, biasing the learning process, is inevitable for the success of learning algorithms (this is formally stated and proved as the “No-Free-Lunch theorem” in Chapter 5).
Since the choice of such a restriction is determined before the learner sees the training data, it should ideally be based on some prior knowledge about the problem to be learned.
Roughly speaking, the stronger the prior knowledge (or prior assumptions) that one starts the learning process with, the easier it is to learn from further examples. However, the stronger these prior assumptions are, the less flexible the learning is – it is bound, a priori, by the commitment to these assumptions.
Tradeoff of using a hypothesis class:
- More restricted hypothesis class better protects us against overfitting
- At the same time might cause us a stronger inductive bias.
6. ERM with inductive bias
Big question: In learning theory, over which hypothesis classes ERM learning will not result in overfitting?
Key takeaway: By putting an upperobound on the size of H, we can show that the ERM will not overfit, provided it is based on a sufficiciently large training sample.
Let’s apply $ERM_{\mathcal{H}}$ on our training sample S
Prior to delving into the result, let’s focus on the assumptions behind the
TODO:
- Readability
- Add a TLDR
6.1. The realizability assumption
Definition: (Realizability assumption) There exists $h^{\text{}} \in \mathcal{H}$ s.t. $L_{(\mathcal{D}, f)}(h^{\text{}}) = 0$
It basically means that there exists a predictor that have loss $0$ on the distribution (which imply that the loss on the training set is $0$).
6.2. The i.i.d. assumption
Definition: (i.i.d.) The examples in the training set are independently and identically distributed (i.i.d.) according to the distribution $\mathcal{D}$
Intuitively: Training set $S$ is a window through which the learner gets partial information about the distribution D over the world and the labeling function, $f$. The larger the sample gets, the more likely it is to reflect more accurately the distribution and labeling used to generate it.
Definitions:
- $\sigma$ be the probability of getting a nonrepresentative sample
- $(1 - \sigma)$ the confidence parameter of our prediction.
- $\epsilon$ is the accuracy parameter
TODO:
- Readability
- Add a TLDR
7. Probability of getting a wrong sample
Goal: Find an upper bounding the probability to sample $m$-tuple of in stances that will lead to failure of the learner.
Theorem:
Corollary: Let $\mathcal{H}$ be finite hypothesis class. Let $\delta \in (0, 1)$ and $\epsilon > 0$ and let $m$ be an integer that satisfies:
Then, for any labeling function f and for any distribution $\mathcal{D}$, for which the realizability assumption holds with probability at least $1-\delta$ over hte choice of an i.i.d. sample S of size m, we have that for every ERM hypothesis h_{S}, it holds that:
TODO:
- Readability
- Add a TLDR
TODO: Add future work
8. Sources
- Understanding Machine Learning: From Theory to Algorithms (Chapters 1 and 2)
- The Elements of Statistical Learning (Chapter 1)
- Convex Optimization: Algorithms and Complexity (Sections 1.1, 1.2, 1.3)
- Convex optimization and gradient descent, lecture notes by Nisheeth Vishnoi, EPFL (Sections 1.1, 1.2, 1.3)
Laisser un commentaire