Formation
COORDINATOR
Michel Riveill, PR Université Côte d'Azur, Polytech I3S
LOCATION
Tutorials at the Valrose Campus
Videoconference lectures remotely held from the SophiaTech Campus
Prerequisites
ABOUT THIS MINOR
- Summary
-
Broadly speaking, Machine learning (ML) is the scientific field aiming at building models and inferring knowledge by applying algorithms to data. Therefore, the process involves the (statistical) analysis of data, and the design of models, possibly predictive. These tasks are fundamental ones in modern science in general, and biology - medicine in particular. This course will develop an introduction to ML, by reviewing the fundamental principles and methods.
Each lecture will be accompanied by a hands-on practical (in python), during which datasets of biological and/or medical importance will be processed. Doing so will provide a unique opportunity to assess the performances of the various methods studied (running time, stability, sensitivity to noise/outliers, etc), and to think critically about the quality of models in biology/medicine.
The datasets used during the practicals will cover the main classes of data used in modern biology, at all scales (individual molecules, cells, organs, individuals).
- General introduction
-
This lecture will introduce the main ingredients of ML, namely the different classes of problems, the data involved in such processes, the main classes of algorithms, and the learning process.
Topics :
- Data types
- Supervised vs non-supervised learning
- Algorithms taxonomy
- Software platforms and languages
Practical/potential applications :- Data manipulations
- Model complexity and under/over fitting
- The bias-variance trade-off
- Regression with the linear model
-
Regression is the problem concerned with the prediction a response value from variables. This course will cover the basics of the method including the selection of variables and the design of sparse models.
Topics :
- Linear regression and least squares
- Errors and model adequacy
- Sparse models
Practical/potential application :
The prostate cancer dataset (cf The elements of statistical learning) - Classification with the logistic regression
-
Logistic regression is a supervised classification algorithm used to model the probability of an observation to belong to a given class. To do so, a linear model is used to estimate the parameters.
Topics :
- Classification using linear models
- The logistic regression
Practical/potential application :
South African heart disease dataset (cf The elements of statistical learning) - Support Vector Machines (SVM)
-
SVM are a popular and robust class of models to perform supervised classification. The main difficulties are to deal with classes which are partially mixed -- e.g. due to noise, and whose boundaries have a complex geometry.
Topics :
- Linear separability and support vectors
- Soft margin separators
- Kernels and non linear separation
- Multiclass classification
Practical/potential application :
Classification of protein and DNA sequences (paper: Biological applications of support vector machines) - Linear Discriminant Analysis
-
LDA is another supervised classification algorithm using a linear combination of features defining boundaries separating two or more classes. This lecture will introduce LDA and compare it to the so-called Naive Bayes classifier.
Topics :
- Naive Bayes classifier
- LDA
Practical/potential application :
Those of lectures 4 and 5 - CART / Decision Tree / Random Forest
-
Tree based models partition the data space to exploit local properties of the data, and can be used both for regression and classification. Multiple trees can also be combined to compensate the arbitrariness of the partitioning induced by a single tree.
Topics :Classification And Regression Trees
Decision tree based classification
Tree induction and split rules
Ensembles of decision trees and random forests
Practical/potential application :
The iris data set + datasets used in lectures 2 and 3. - Clustering (k-means, hclust)
-
In a non supervised context, clustering aims at grouping the data in homogeneous groups by minimizing the intra-group variance. This fundamental task is surprisingly challenging due to several difficulties: the (generally) unknown number of clusters, clusters whose boundaries have a complex geometry, dealing with overlapping clusters (due to noise), dealing with high dimensional data, etc. This class will present two main clustering techniques:
Topics :
- k-means and k-means++
- Hierarchical clustering
Practical/potential applications :
Clustering molecular conformations - Dimension reduction (PSA, t-SNE)
-
Dimensionality reduction methods aim at embedding high-dimensional data into a lower-dimensional space, while preserving specific properties such as pairwise distances, the data spread, etc. Originating with the celebrated Principal Components Analysis method, recent methods have focused on data located on non linear spaces.
Topics :
- Principal Component Analysis
- t-Stochastic Neighbor Embedding (t-SNE)
Practical/potential applications :
Cell classification from RNAseq data (cf papers by Dana Pe'er)
- Lecturers
-
- Rodrigo Cabral Farias (MC Université Côte d'Azur, Polytech, I3S)
- Lionel Fillatre (PR Université Côte d'Azur, Polytech, I3S)
- Michel Riveill (PR Université Côte d'Azur, Polytech, I3S)
- Prerequisites
-
Python programming :
- Test your skills in Phyton : auto-eval (unice.fr)
- FUN MOOC :
- Coursera :
English comprehension : level B1 recommended- Test your English : https://www.cambridgeenglish.org/fr/test-your-english/general-english/
- Pedagogical resources
-
- Jupyter Notebook : Project Jupyter | Home
- Bibliography
-
- Introduction to machine learning, E. Alpaydin
- The elements of statistical learning, T. Hastie, R. Tibshirani, J. Friedman
- Machine learning: a probabilistic perpsecitve, K. Murphy
- Learning with kernels, B. Scholkopf and A. Smola
- Python data science handbook, J. VanderPlas