Interview with Maria Kateri on the role of statistics in data science
Prof. Dr. Maria Kateri is a professor of Statistics and Data Science and leading one of the groups involved in the RWTH Center for Artificial Intelligence. She is an expert in statistics as well as data science, with a focus on categorical data analysis, reliability, multivariate statistical modelling and inference. Together with Prof. Alan Agresti from the University of Florida she recently authored the book “Foundations of Statistics for Data Scientists: With R and Python”. Therefore, we conducted an interview to ask about the new book and the relevance of her work in the context of the RWTH Center for Artificial Intelligence.
You are a professor of Statistics and Data Science at RWTH Aachen University and contribute to the mathematical foundations of the RWTH Center for Artificial Intelligence. Can you tell more about your research focus?
My main research areas are stochastic modeling and statistical inference with applications in, e.g., reliability, engineering and natural sciences, behavioral and social sciences. Thus, I consider the development of statistical methods for estimation and testing, model building and model validation, simulations and procedures adjustments for implementation on real data setups. In particular, my focus is on categorical and ordinal data analysis, high-dimensional statistics and variables selection, statistical information theory, design and analysis of accelerated life testing (ALT) experiments and analysis of censored data with applications in engineering. For example, this period my group is working on two interdisciplinary research projects, one on accelerated lifetime estimation and prediction for lithium-ion batteries (with the team of Dirk Uwe Sauer in RWTH and partners from other universities and research institutions, funded by BMBF) and another on Bayesian statistical modeling of Raman spectroscopy data (with Hans-Jürgen Koß and Leo Bahr, under interdisciplinary seed-funding within the call AI, Simulation, and Data Science: Data-driven Methods for Real-world Challenges). In my research I adopt both, the frequentist and the Bayesian approach.
Some would argue that Automated Machine Learning (AutoML) can be applied to a range of data science problems and users no longer need to know how it really works. What would be your response?
It is the classical dispute of stochastic vs. algorithmic modeling, the two cultures of modeling. AutoML focuses on developing prediction tools and algorithms while stochastic models target at understanding the underlying data mechanism and use this mechanism for predictions, ensuring thus reproducibility of results from data analysis and controlling uncertainty. They target essentially at the same goal but approach it from different perspectives. I strongly believe that statistical thinking and the principles of statistical inference are important also for algorithmic approaches. Furthermore, some methods of ML are identical to statistical methods and at certain points the distinction between ML and statistics is not that clear. I would oppose “statistical learning” to “machine learning”. In my view, people from statistics and data science should combine their strengths and utilize synergies. Statistics develops foundations and theory that provide principles for data analysis, tackling new challenges in analyzing complex data and having impact that is not limited to a single application. Statistics can provide data science with frameworks and tools to build solid fundaments. Issues like accuracy and validity of data analyses in data science, reproducibility and causality, uncertainty quantification in prediction, are very important and can be tackled with the help of statistics.
Your new book “Foundations of Statistics for Data Scientists: With R and Python” published by Chapman and Hall/CRC (see below for the full reference) aims at students training to become data scientists. What are the unique features of this book?
The book is designed as a textbook for an introduction to mathematical statistics for students training to become data scientists and provides an in-depth presentation of the topics in statistical science with which any data scientist should be familiar, including probability distributions, descriptive and inferential statistical methods, and linear modeling. However, it is also suitable for statistical courses for students in other disciplines and it is useful for researchers from other disciplines and practitioners. Compared to traditional textbooks of mathematical statistics, the book has less emphasis on probability theory and more emphasis on using software to implement statistical methods. The book’s website (s. below) offers a rich collection of data sets, all of them discussed in examples and exercises. A unique feature of this book is that it combines aims and characteristics from classical books on statistics and from applied books focusing on data analysis by a software. Conceptualizing the book, our first priority was to gain deep understanding of key statistical concepts and methods and simultaneously learn how to apply all the methods and models discussed in practice. To convey understanding, we conduct simulations to illustrate key concepts. All statistical analyses in the book use R software, the presentation of which is integrated smoothly in the presentation of the methods in the chapters. An appendix shows the same analyses using Python. Hence, all methods and models discussed in the book are all implemented in R and Python. The book also introduces modern topics that do not usually appear in introductory mathematical statistics books but are highly relevant for data scientists, such as Bayesian inference, generalized linear models for non-normal responses (e.g., logistic regression and Poisson loglinear models), and regularized model fitting for high-dimensional problems. Since practicing is very important in understanding and learning statistics, the book contains nearly 500 exercises, theoretical and applied ones. Furthermore, the book's website, beyond all data sets from the examples and exercises, provides expanded R, Python, and Matlab appendices.
The subtitle “With R and Python” suggests that tooling is important for you. Can you tell us more about it? What is your favorite language: R or Python? Why?
You are right, tooling is extremely important, without however degrading the value of theory. At the end of the day, we want to guide students to be able to apply statistical methods efficiently in practice, understand the outcome and report it adequately. For this, software skills are undoubtedly required but are not sufficient. A deep understanding of the underlying theory is crucial. Hence, a course (and a book) of statistics for data scientists needs to balance between theory and software applications. For me it is an interplay, efficient tooling prerequisites a solid theoretical background while software applications can be employed to stimulate understanding of methods and procedures.
In my research, theory and implementation are combined, in general. I concentrate on developing stochastic models and procedures for estimation and fitting, investigating their properties, but simultaneously it is important for me these developments to be of practical use. For this, construction and disposal of the associated algorithms in statistical software is part of the work.
In teaching and in research, I use R, which is the standard language in the broad statistical community. Beyond the basic R packages for statistical computing and graphics, there are many more available through the CRAN (Comprehensive R Archive Network) covering a very wide range of updated modern statistical methods for data analysis and visualization. Python is often preferred in the data science community and has more straightforward procedures for, e.g., data wrangling and graphs. However, both languages are developing rapidly; e.g., the tidyverse collection of R packages are designed for data science while Python strengthens its pool of statistical methods. I would say that it is good a data scientist to be familiar with both languages.
The uptake of Data Science and Machine Learning changed the importance of different subfields of statistics. I guess that Bayesian inference and generalized linear models (e.g., logistic regression) are very important for any data scientist. What are other topics that have become more important, and which subfields have become less important?
As you already mentioned, Bayesian approaches and the Bayesian way of thinking are natural frameworks for Machine Learning, since we do not learn based on a specific training data set, only, but integrate prior information and probably experts’ opinion as well. In terms of models and statistical methods that are of great importance for data scientists, beyond the generalized linear models (GLMs) that provide a flexibility and a unified setup for modeling non-normally distributed data, also categorical, I would also mention graphical models, generalized additive models (GAMs), methods and models for the analysis of time-dependent data, functional data analysis, methods for high dimensional data (e.g., lasso), extreme value theory, … Furthermore, computational statistics is a field of increasing interest (consider, e.g., Markov Chain Monte Carlo (MCMC) algorithms, bootstrap methods, EM algorithms). Topics that are less important for Data Science and Machine Learning, are for example methods for exact inference and small sample inference. However, they are important for other fields (e.g., in biostatistics and psychometrics).
The book focuses on things a data scientist should know. However, we are, of course, also interested in your current research. What paper are you currently working on?
Presently, among the papers I am working on, the one that is directly related to data science deals with high-dimensional Bayesian variable selection via an adaptive MCMC method, called metropolized adaptive subspace algorithm, a joint work with Christian Staerk and Ioannis Ntzoufras. It is based on an independent Metropolis-Hastings sampler, where the sampling probabilities of the explanatory variables are sequentially adapted after each iteration using a form of Bayesian learning. We prove that these finally converge against the respective posterior marginal inclusion probabilities.
Great. Thank you for the responses and contributions to the RWTH Center for Artificial Intelligence!
Agresti, A. & Kateri, M. (2022). Foundations of Statistics for Data Scientists: With R and Python . Chapman and Hall/CRC
Book's website: http://stat4ds.rwth-aachen.de/