B32
Course Coordinator: 
Tomoki Fukai
Statistical Modeling
Description: 

Modern technologies in experimental sciences and computational sciences generate a large amount of high-dimensional data. In contrast to classical hypothesis testing, in modern statistical methods data are used to construct a statistical model that generated the dataset. Such a model not only describes the statistical properties of the past observations but also enables the prediction of future observations. The course is designed for students who wish to learn the mathematical methods to analyze high-dimensional data for active inference. 

This course is based on the second half of the previous course "B07 Statistical Methods".

By completing this course, students will receive one credit.

This course can be taken immediately after the course B31 Statistical Testing, or as a stand-alone course.

Aim: 
This course provides an introduction to machine learning. Students are required to have some knowledge and skills in mathematics. However, the course is not intended for pure theorists. Students with a computational background will best fit the materials of the course. However, students with an experimental background are also highly welcome if they have strong interests in modern data analysis.
Course Content: 

1. Introduction
The following points are explained.
Hypothesis testing versus statistical modeling (or machine learning)
Needs and difficulties of high-dimensional data analysis
A true model? Does it exist or not?

2. Regression analysis and maximum likelihood
Regression is an entry to statistical modeling and is yet a practically important topic. Some representative methods such as linear regression will be explained together with the Maximum likelihood Estimation. Logistic regression and Receiver operation curve (ROC) are also discussed.

3. Model selection
In statistical modeling, the same data can be fitted with multiple models having different levels of complexity. However, a complex model may induce an overfitting of data. Model selection enables us a systematic comparison between the different models. Akaike information criteria will be explained in detail.

4. Dimensional reduction
When high-dimensional data are given, we may analyze the structure of the distribution in a low-dimensional projection space. Principal component analysis (PCA) provides a basic yet very useful method for dimensional reduction of datasets. As a related topics, singular value decomposition (SVD) will be also discussed.

5. Categorization and decision making
Categorization or classification of data points is important in a wide range of real-world decision-making problems. Several methods for categorization including Linear discriminant analysis (LDA) are introduced. I will also explain a widely-used framework of Generalized linear model (GLM).

6. Bayesian approach
In contrast to the maximum likelihood estimation of a true model, Bayesian statistical modeling assumes that observations result from a mixture of multiple statistical models. I will explain the basic concepts of Bayesian statistical modeling using simple examples.

7. Clustering
Clustering is required in various feature-detection problems in data science, and several methods are known in machine learning. Here, we first study one of the simplest methods, k-means clustering. Bayesian approaches to clustering problems will be also discussed. Such topics will include Variational Bayes clustering if time allows.

Course Type: 
Elective
Credits: 
1
Assessment: 
Weekly exercises and homework (75%), in-term examination (25%).
Text Book: 
Christopher M. Bishop, "Pattern Recognition and Machine Learning", Springer Science, 2006.
Prior Knowledge: 

Basic knowledge and skills in linear algebra (matrix, eigenvalues, eigenvectors, etc.), calculus (differentiation and integration of functions), and probability theories (Gaussian and some other distributions) are required. Skills in Python programming are also required for exercises.