Artificial Intelligence : introduction to Machine Learning

COORDINATOR

Michel Riveill, PR Université Côte d'Azur, Polytech I3S

LOCATION

Tutorials at the Valrose Campus
Videoconference lectures remotely held from the SophiaTech Campus

Prerequisites

Python programming (see details below)

ABOUT THIS MINOR

This minor is also open to students from the DS4H and SPECTRUM graduate schools.
Summary

Broadly speaking, Machine learning (ML) is the scientific field aiming at building models and inferring knowledge by applying algorithms to data. Therefore, the process involves the (statistical) analysis of data, and the design of models, possibly predictive. These tasks are fundamental ones in modern science in general, and biology - medicine in particular. This course will develop an introduction to ML, by reviewing the fundamental principles and methods.

Each lecture will be accompanied by a hands-on practical (in python), during which datasets of biological and/or medical importance will be processed. Doing so will provide a unique opportunity to assess the performances of the various methods studied (running time, stability, sensitivity to noise/outliers, etc), and to think critically about the quality of models in biology/medicine.

The datasets used during the practicals will cover the main classes of data used in modern biology, at all scales (individual molecules, cells, organs, individuals).

General introduction

This lecture will introduce the main ingredients of ML, namely the different classes of problems, the data involved in such processes, the main classes of algorithms, and the learning process.

Topics :

  • Data types
  • Supervised vs non-supervised learning
  • Algorithms taxonomy
  • Software platforms and languages


Practical/potential applications :

  • Data manipulations
  • Model complexity and under/over fitting
  • The bias-variance trade-off
Regression with the linear model

Regression is the problem concerned with the prediction a response value from variables. This course will cover the basics of the method including the selection of variables and the design of sparse models.

Topics :

  • Linear regression and least squares
  • Errors and model adequacy
  • Sparse models


Practical/potential application :
The prostate cancer dataset (cf The elements of statistical learning)

Classification with the logistic regression

Logistic regression is a supervised classification algorithm used to model the probability of an observation to belong to a given class. To do so, a linear model is used to estimate the parameters.

Topics :

  • Classification using linear models
  • The logistic regression


Practical/potential application :
South African heart disease dataset (cf The elements of statistical learning)

Support Vector Machines (SVM)

SVM are a popular and robust class of models to perform supervised classification. The main difficulties are to deal with classes which are partially mixed -- e.g. due to noise, and whose boundaries have a complex geometry.

Topics :

  • Linear separability and support vectors
  • Soft margin separators
  • Kernels and non linear separation
  • Multiclass classification


Practical/potential application :
Classification of protein and DNA sequences (paper: Biological applications of support vector machines)

Linear Discriminant Analysis

LDA is another supervised classification algorithm using a linear combination of features defining boundaries separating two or more classes. This lecture will introduce LDA and compare it to the so-called Naive Bayes classifier.

Topics :

  • Naive Bayes classifier
  • LDA


Practical/potential application :
Those of lectures 4 and 5

CART / Decision Tree / Random Forest 

Tree based models partition the data space to exploit local properties of the data, and can be used both for regression and classification. Multiple trees can also be combined to compensate the arbitrariness of the partitioning induced by a single tree.


Topics :

Classification And Regression Trees
Decision tree based classification
Tree induction and split rules
Ensembles of decision trees and random forests

Practical/potential application :
The iris data set + datasets used in lectures 2 and 3.

Clustering (k-means, hclust)

In a non supervised context, clustering aims at grouping the data in homogeneous groups by minimizing the intra-group variance.  This fundamental task is surprisingly challenging due to several difficulties: the (generally) unknown number of clusters, clusters whose boundaries have a complex geometry, dealing with overlapping clusters (due to noise), dealing with high dimensional data, etc. This class will present two main clustering techniques:

Topics :

  • k-means and k-means++
  • Hierarchical clustering


Practical/potential applications :
Clustering molecular conformations

Dimension reduction (PSA, t-SNE)

Dimensionality reduction methods aim at embedding high-dimensional data into a lower-dimensional space, while preserving specific properties such as pairwise distances, the data spread, etc. Originating with the celebrated Principal Components Analysis method, recent methods have focused on data located on non linear spaces.

Topics :

  • Principal Component Analysis
  • t-Stochastic Neighbor Embedding (t-SNE)


Practical/potential applications :
Cell classification from RNAseq data (cf papers by Dana Pe'er)

Lecturers
Prerequisites

Python programming :

 
English comprehension : level B1 recommended
Pedagogical resources
Bibliography
  • Introduction to machine learning, E. Alpaydin
  • The elements of statistical learning, T. Hastie, R. Tibshirani, J. Friedman
  • Machine learning: a probabilistic perpsecitve, K. Murphy
  • Learning with kernels, B. Scholkopf and A. Smola
  • Python data science handbook, J. VanderPlas