Welcome to the following Master thesis defense:

Student: Susanna Pozzoli

Title: Domain Expertise-Agnostic Feature Selection for the Analysis of Breast Cancer Data
Time: Friday, June 14, 2019 @ 10:00
Place: Ada Room, Electrum, 4th floor, Elevator A.
Examiner: Sarunas Girdzijauskas
Academic Supervisor: Leila Bahri
Industrial Supervisor: Amira El Hosary
Opponent: Tianze Wang
Language: English

Abstract:
At present, high-dimensional data sets are becoming more and more frequent. The problem of feature selection has already become widespread, owing to the curse of dimensionality. Unfortunately, feature selection is largely based on ground truth and domain expertise. It is possible that ground truth and/or domain expertise will be unavailable, therefore there is a growing need for unsupervised feature selection in multiple fields, such as marketing and proteomics.

Now, unlike in past time, it is possible for biologists to measure the amount of protein in a cancer cell. No wonder the data is high-dimensional, the human body is composed of thousands and thousands of proteins. Intuitively, only a handful of proteins cause the onset of the disease. It might be desirable to cluster the cancer sufferers, but at the same time we want to find the proteins that produce good partitions.

We hereby propose a methodology designed to find the features able to maximize the clustering performance. After we divided the proteins into different groups, we clustered the patients. Next, we evaluated the clustering performance. We developed a couple of pipelines. Whilst the first focuses its attention on the data provided by the laboratory, the second takes advantage both of the external data on protein complexes and of the internal data. We set the threshold of clustering performance thanks to the biologists at Karolinska Institutet who contributed to the project.

In the thesis we show how to make a good selection of features without domain expertise in case of breast cancer data. This experiment illustrates how we can reach a clustering performance up to ten times better than the baseline with the aid of feature selection.

Keywords:
breast cancer, clustering, clustering performance evaluation, feature selection, proteomics, unsupervised learning