Skip to main content

Thi Thuy Nga Nguyen: Diffusion map - A nonlinear dimensionality reduction method

Masters Thesis

Time: Thu 2020-06-11 10.45

Location: Zoom, meeting ID: 901 405 086

Participating: Thi Thuy Nga Nguyen

Subject area: Mathematical Statistics

Supervisor: Chun-Biu Li


In this report, we discuss the diffusion map, a nonlinear dimensionality reduction method, which focuses on discovering the underlying geometrical structure of data. We first consider a given dataset as a weighted graph, then construct a Markov chain on this graph. The transition matrix P , expressed by a Gaussian kernel function with a single parameter σ-kernel width, characterizes the local connectivity on the graph. When running the Markov chain forward in time, P^t integrates the local connectivities to describe the global connectivity, leading to the idea of diffusion distance viewed as the average length of all paths connecting ttwo nodes in the weighted graph. The diffusion distance defined from P^t thus contains the information of the intrinsic data geometry. We then use eigen-decomposition of P^t to define diffusion coordinates and the diffusion map. This map reorganizes the data according to the diffusion distance in which the points are close if they are highly connected in the weighted graph. Moreover, the diffusion map embeds the data in a lower-dimensional space where the Euclidean distance is an approximation of the diffusion distance. Diffusion map is very robust to noise and easy to implement. Nevertheless, it is not trivial to choose the parameters appropriately: the kernel width σ, the timescale t, and the dimension of the embedding space q. While the kernel width relates to how transition probabilities describe local connectivity, the timescale t affects the ability of the diffusion distances to capture global connectivity, and the dimension q characterizes the intrinsic dimension of the representation we would like to discover. All of them are investigated by toy examples in our report.