The Quest for Robust Model Selection Methods in Linear Regression
Tid: Må 2022-10-31 kl 10.00
Ämnesområde: Elektro- och systemteknik Matematisk statistik
Respondent: Prakash Borpatra Gohain , Teknisk informationsvetenskap, KTH Royal Institute of Technology, Sweden, ISE
Opponent: Professor K.V.S Hari, Indian Institute of Science, bengaluru
Handledare: Professor Magnus Jansson, Teknisk informationsvetenskap
A fundamental requirement in data analysis is fitting the data to a model that can be used for the purpose of prediction and knowledge discovery. A typical and favored approach is using a linear model that explains the relationship between the response and the independent variables. Linear models are simple, mathematically tractable, and have sound explainable attributes that make them widely ubiquitous in many different fields of applications. Nonetheless, finding the best model (or true model if it exists) is a challenging task that requires meticulous attention. In this PhD thesis, we consider the problem of model selection (MS) in linear regression with a greater focus on the high-dimensional setting when the parameter dimension is quite large compared to the number of available observations. Most of the existing methods of MS struggle in two major areas, viz., consistency and scale-invariance. Consistency refers to the property of the MS method to be able to pick the true model as the sample size grows large or/and when the signal-to-noise-ratio (SNR) increases. Scale-invariance indicates that the performance of the MS method is invariant and stable to any kind of data scaling. These two properties are very crucial for any MS method. In the field of MS employing information criteria, the BayesianInformation Criterion (BIC) is undoubtedly the most popular and widely-used method. However, the new BIC forms including the extended versions designed for the high-SNR scenarios are not invariant to data-scaling and our results indicate that their performance is quite unstable under different scaling scenarios. To eradicate this problem we proposed improved versions of the BIC criterion viz., BICR and EBICR where the subscript ‘R’ stands for robust. BICR is based on the classical setting of order selection, whereas EBICR is the extended version of BICR to handle MS in the high-dimensional setting where it is quite possible that the parameter dimension p also grows with the sample size N. We analyze their performance as N grows large as well as when the noise variance diminishes towards zero, and provide detailed analytical proofs to guarantee their consistency in both cases. Simulation results indicate that the performance of the proposed MS criteria is robust to any data scaling and offers significant improvement in correctly picking the true model. Additionally, we generalize EBICR to handle the problem of MS in block-sparse high-dimensional general linear regression. Block-sparsity is a phenomenon that is seen in many applications. Nevertheless, the existing MS methods based on information criteria are not designed to handle the block structure of the linear model. The proposed generalization handles the block structure effortlessly and can be employed for MS in any type of linear regression framework.