Modern technologies in experimental sciences and computational sciences generate a large amount of high-dimensional data. In contrast to classical hypothesis testing, in modern statistical methods data are used to construct a statistical model that generated the dataset. Such a model not only describes the statistical properties of the past observations but also enables the prediction of future observations. The course is designed for students who wish to learn the mathematical methods to analyze high-dimensional data for active inference.

This course is based on the second half of the previous course "B07 Statistical Methods".

By completing this course, students will receive one credit.

This course can be taken immediately after the course B31 Statistical Testing, or as a stand-alone course.

1. Introduction

The following points are explained.

Hypothesis testing versus statistical modeling (or machine learning)

Needs and difficulties of high-dimensional data analysis

A true model? Does it exist or not?

2. Regression analysis and maximum likelihood

Regression is an entry to statistical modeling and is yet a practically important topic. Some representative methods such as linear regression will be explained together with the Maximum likelihood Estimation. Logistic regression and Receiver operation curve (ROC) are also discussed.

3. Model selection

In statistical modeling, the same data can be fitted with multiple models having different levels of complexity. However, a complex model may induce an overfitting of data. Model selection enables us a systematic comparison between the different models. Akaike information criteria will be explained in detail.

4. Dimensional reduction

When high-dimensional data are given, we may analyze the structure of the distribution in a low-dimensional projection space. Principal component analysis (PCA) provides a basic yet very useful method for dimensional reduction of datasets. As a related topics, singular value decomposition (SVD) will be also discussed.

5. Categorization and decision making

Categorization or classification of data points is important in a wide range of real-world decision-making problems. Several methods for categorization including Linear discriminant analysis (LDA) are introduced. I will also explain a widely-used framework of Generalized linear model (GLM).

6. Bayesian approach

In contrast to the maximum likelihood estimation of a true model, Bayesian statistical modeling assumes that observations result from a mixture of multiple statistical models. I will explain the basic concepts of Bayesian statistical modeling using simple examples.

7. Clustering

Clustering is required in various feature-detection problems in data science, and several methods are known in machine learning. Here, we first study one of the simplest methods, k-means clustering. Bayesian approaches to clustering problems will be also discussed. Such topics will include Variational Bayes clustering if time allows.

Basic knowledge and skills in linear algebra (matrix, eigenvalues, eigenvectors, etc.), calculus (differentiation and integration of functions), and probability theories (Gaussian and some other distributions) are required. Skills in Python programming are also required for exercises.