SC.01

SC.01 - Targeted learning: bridging machine learning and causality

Sunday, 5 July 2020  |  9:00 - 17:00 | Location: TBD

Instructors:
  • Antoine Chambaz
  • David Benkeser

Summary
Enthusiasm surrounds the application of machine learning (ML) in many disciplines. This excitement must be tempered by the recognition that current ML
algorithms are designed to learn intricate dependencies, not to draw valid causal or statistical inference. Integrating ML algorithms with formal statistical frame-works for causation and inference allows to leverage the advances in ML while safeguarding against spurious conclusions.

Coined in 2006, targeting learning (TL) is a general approach to learning from data that reconciles ML and inference. TL exploits ML to estimate infinite-
dimensional features of the data-generating law, such as regression functions. ML algorithms are highly versatile and produce (possibly highly) data-adaptive
estimators, which pose challenges for drawing valid inference. TL provides a framework for de-biasing ML estimates in order to generate robust and efficient
procedures, to build confidence regions and hypotheses tests.

TL has been gaining traction steadily for more than a decade. It has been applied and studied in a great variety of contexts [van der Laan & Rose, Springer,
2011, 2018). Its analysis is framed in the theory of inference based on semiparametric models. We will illustrate how TL unfolds in an example from causal
analysis, using a R package designed specifically for this purpose.

Prerequisites
This course is best suited for indivdiuals with graduate level statistical training, for example, consistent with the level of Casella & Berger's Statistical Inference [Duxbury Pr, 2008]. Many concepts will be illustrated both mathematically and computationally using R. Thus, some understanding of basic R for data analysis is useful, for example computation training consistent with Wickham & Grolemund's R for Data Science [O'Reilly, 2017]

Outline
A Modeling
A1 Rudiments of causal analysis: motivating examples from analysis of vaccine efficacy trials and HIV prevention. Rudiments of causal analysis based on directed acyclic graphs
A2 The parameter of interest: Introducing the parameter of interest, motivated by a causal quantity (the causal effect of a two-level treatment). Viewing the parameter of interest as the value of a functional defined on the statistical model at the law of the data.
A3 Regularity: Discussing the regularity/smoothness of the above functional, based on the notion of fluctuation (merely, a finite-dimensional model through a given law).
A4 Double-robustness: Discussing a remarkable property that the functional enjoys and that we fully exploit. Defining the so-called remainder term. 
B Inference
B1 A simple inferential strategy: Introducing and discussing the Inverse-Probability of Treatment Weighting (IPTW) estimator.
B2 Nuisance parameters: An algorithmic stance on the estimation of nuisance parameters that one must carry out to infer the parameter of interest.
Why machine learning? How? Formalizing and applying machine learning, through the use of SuperLearner and caret R packages. 
B3 Two naive plug-in estimators: More on the IPTW estimator, and introducing and discussing the so-called G-COMP estimator.
B4 Standard analysis of plug-in estimators
B5 One-step correction: introduction and discussing the one-step correction principle (Le Cam 56, Pfanzagl 82).
B6 Targeted minimum loss estimation: introducing and discussing the targeted minimum loss estimation methodology. 

Learning Outcomes
Better understand causal frameworks and their relation to classic statistical inference problems;
Better understand how machine learning can be (and should be) incorporated into data analysis;
Better understand how semiparametric theory facilitates strategies for statistical inference;
Be able to use the R packages SuperLearner and caret for performing basic machine learning;
Be able to use the R package drtmle for basic causal inference combined with machine learning.

About the Instructors
Antoine Chambaz, is a Professor of Statistics and Universite Paris Descartes, France. His main research interest is in theoretical, computational and applied statistics. He is particularly interested in the theory of inference based on semiparametric models, targeted learning, and their application to precision medicine. Antoine Chambaz has been teaching statistics for twenty years, to first-to-fifth year students in France. He has never taught in English, but is an experienced speaker in English. Recently, he also gave a series of lectures on causality and statistics at:
 - Jourees d'Etude en Statistique (JES), Frejus (Oct. 15-19, 2019);
 - Statistics for Post Genomic Data, Barcelona (Jan. 31, 2018);
 - Mathematiques et medecine, Ecole de I'Inserm Liliane Bettencourt (EdILB), Sevres (Feb. 13, 2018). 

David Benkeser is an Assistant Professor of Biostatistics and Bioinformatics at Emory University in Atlanta, GA, USA. His research interests include causal inference, vaccine efficacy trials, HIV prevention trials, semiparametric efficiency theory, and machine learning. His teaching experience includes: 
- Targeted Learning in Biomedical Big Data, University of California, Berkeley (Jan-May 2017)
- Statistical Inference, Emory University, Atlanta (Jan-May 2018)
- Modern Statistical Learner Methods for Biomedical Data, University of Washington Summer Institute in Clinical Research (2-day short course, July 2017, 2018)

Recommendations for the Course:
Textbooks:
1. Article by D. Benkeser & A. Chambaz (Available October 2019)
2. The monographs Targeted Learning and Targeted Learning in Data Science [van der Laan & Rose, Spring, 2011, 2018] 
3. R for Data Science [Wickham & Grolemund, O'Reilly, 2017] is recommended to start using R. 

Laptop: Presenters recommend that the members of the audience have their laptops to run the R codes themselves.