Statistics in Practice

Fast algorithms and modern visualizations for feature selection

This short course focuses on model selection techniques for regression models in two scenarios: when an extensive search of the model space is possible as well as when the dimension is large and either stepwise algorithms or regularisation techniques have to be employed to identify good models. We incorporate recent research on graphical tools for model choice and touch on how to tune regularisation procedures, such as the Lasso through resampling or model selection criteria. Importantly, the limitations of the various model selection procedures will be discussed. A key component of the course is assessing the stability of selected components which is paramount for reliable predictive final models. We show how this can be achieved through visualizing measures of stability.

The practical implementation of the discussed methods is an essential component of this course. Interactive labs will give participants the opportunity to apply what they have learnt with some material that can be done after the course, to further digest the material. We will use the cross-platform, open-source software R, in particular we will make use of the leaps, bestglm, glmnet and the mplot package.

For exhaustive model searching we will show how to learn more with the leaps and bestglm packages. When exhaustive search is not possible, we will show how to use penalised regression methods as fast alternatives through the glmnet package. To assess stability in model selection, we will take advantage of both, bootstrapping and cross-validating regression models and will show powerful ways to visualise the results using advanced graphics with the mplot package.

Statistical model building is a fundamental part of many statistical analyses and will be of potential interest to all attendees at the IBC. The aim of all analyses is to use the data and, if available, information about its generating process, to construct statistical models which parsimoniously describe relevant and important features in the data. Too often in applied statistics model selection procedures is based on outdated methods, for example stepwise techniques. This workshop will highlight the limitations of established model selection methods and showcase more recent approaches for selecting with a focus on selecting a stable model.