autumn 2024 Projects


Antonio Olivas: Function Estimation using Reproducing Kernel Hilbert Spaces

Prerequisites: Real analysis at the level of Math 424

Reproducing kernel Hilbert spaces (RKHSs) are a particular case of Hilbert spaces defined by reproducing kernels that enjoy a geometric structure similar to ordinary Euclidean space, and depending on the kernel, may include a reasonably broad class of functions. RKHSs have been widely used to estimate functions that involve optimizing over function spaces which appear in many statistical problems such as interpolation, regression, and density estimation, and they are attractive because many optimization problems over these spaces reduce to relatively simple calculations involving the kernel matrix.

In this project we will read most of Chapter 12 of the textbook “High-Dimensional Statistics. A non-asymptotic viewpoint” by Martin J. Wainwright, related to RKHSs. The goal of this DRP is to understand the properties of the RKHS, and to apply this tool to a real-life problem.



Ethan Ancell: Robust Statistics

Prerequisites: Probability/statistical inference at the level of Stat 394/395 is required, mathematical analysis at the level of Math 424 is highly recommended

Most likely, you have seen a lot of results in statistics that follow a recipe of “suppose the data follows XYZ distribution, or obeys this model ZYX. Then, some result holds.” However, you might ask what happens if the model or assumed distribution is wrong! The area of robust statistics studies what happens to our usual results when models or distributions are misspecified, and how to create estimators that enjoy nice theoretical properties while not being too sensitive to small deviations in model specification. This DRP will follow selected topics from the “Robust Statistics” textbook from Elvezio Ronchetti and Peter Huber, and will primarily study these topics through a more mathematical point of view.



Katherine Delno: An Introduction to Statistical Learning

Prerequisites: Basic probability and statistics (STAT 311 or equivalent), familiarity with regression, and some experience with R.

This reading project aims to provide an introduction to foundational concepts in statistical learning. We will explore key topics such as linear and logistic regression, classification methods, model selection, resampling methods, and regularization techniques. The focus will be on introducing these concepts and providing a broad theoretical overview through reading assignments from the open-access textbook “An Introduction to Statistical Learning with Applications in R, 2nd Edition”. This project aims to provide a broad yet accessible overview of statistical learning; it is not intended for undergraduates who have already taken coursework in this area.



Kayla Irish: Introduction to survival analysis in clinical trials

Prerequisites: Having taken MATH/STAT 395 is highly recommended, and MATH/STAT 394 or STAT 340 is required. Some experience in R is recommended but not required.

Clinical trials often collect and assess survival (or time to event) data from patients. Survival data does not follow the typical distribution of many other types of data. It is non-negative and is often skewed depending on the rate at which events occur. They are also typically subject to censoring (incomplete or missing data), which can occur for a variety of reasons. Survival analysis is an area of statistics designed for modeling this type of data.

Our goal will be to read 1-2 sources to learn the basics of survival analysis, and then we will read about 1-2 clinical trials that used survival analysis methods to determine whether a treatment was effective. If time permits, we will also apply a survival analysis method to a dataset or simulation, such as the Cox proportional hazards model or the Kaplan-Meier estimator.



Nila Cibu: Introduction to Measure Theory

Prerequisites: Highly recommended: linear algebra (Math 208 or Math 340), mathematical analysis (Math 327) Would be cool to have: some probability theory (Stat 394/395 or Stat 341/342)

This will be an introduction to measure theory, one of the foundations of modern probability theory, providing a stronger mathematical foundation for some of the basic statistical theory that has been introduced in classes such as Stat 341 and 394. The sessions will mainly be focused on the analytical aspects of measure theory and if there is time see how they are used in probability theory. There might not be enough time to go over each topic in enough depth, so our priority will most likely be getting a general overview of measure theory.



Olivia McGough: Reading Introduction to Statistical Learning

Prerequisites: The book is written to be accessible to a broad audience, so I think the only prerequisite is some statistics/probability knowledge and an interest to learn more. Familiarity with R/python would allow us to go through some of the labs, but this is not necessary.

I’ve been wanting to read Introduction to Statistical Learning, so let’s read it together! A lot of terms get thrown around when we talk about statistical learning – (un)supervised learning, deep learning, neural networks, SVMs, etc. – what are these things, how do they relate to each other, and what is the relation to statistics? I imagine we will pick a subset of the chapters to go through, and which/how many chapters we cover will depend on the student and their background/interests.



Ronan Perry: Causal Inference for Social Network Data

Prerequisites: Coursework in probability, statistics, and regression. Coursework related to causal inference is helpful but not necessary.

Most introductory statistics material focuses on settings with independent observations. However, it is often the case that observations are not independent, e.g. social network data and spatial data. We will read, discuss, and implement ideas from the paper “Causal Inference for Social Network Data” by Ogburn et al. We will begin with an overview of regression and causal inference, and then discuss problems and solutions when the data is dependent.



Shirley Mathur and Leon Tran: Modern Approaches to Post-Prediction Inference

Prerequisites: Familiarity with the concept of limits, probability concepts at level of STAT 342, and linear regression at level of STAT 423.

In many fields, such as the physical and life sciences, it is costly and time-consuming to collect “gold-standard” empirical data to analyze and learn about physical and biological systems. However, with the recent advancements in machine learning (ML), we now have models such as AlphaFold, that can use previously collected data to predict from amino acid sequences the folding patterns of proteins that we have never observed before in the lab and thus generate new “gold-standard” empirical data of these protein structures. However, as data of this nature is itself generated from a model, it is natural to wonder if biases in the model used to generate the data might lead to biases and invalid inference of parameters estimated from such data. In this project, we will closely read the recent paper “Assumption-lean and Data-adaptive Post-Prediction Inference” (https://arxiv.org/pdf/2311.14220) to learn about a modern technique that allows us to conduct valid statistical inference on parameters estimated from ML-generated data. The focus of the project will be to learn the methodology proposed in the paper and apply it to analyze either an existing ML-generated dataset chosen by the students, or to analyze a dataset that the students generate from a ML model. If time permits, we will also delve further into the theory underlying the methods introduced in the paper.

Note: this DRP is co-mentored by two graduate students, and will be accepting two undergraduate students into the project.



Skylar Shi: Introduction to Bayesian statistics

Prerequisites: Probability theory; Familiar with some basic distributions such as Normal, Gamma, Poisson, Binomial.

In traditional frequentist statistics, a parameter to be estimated is viewed as a fixed number. What can we do if we want to explore the uncertainty of this parameter? In Bayesian statistics, this parameter is viewed as a random variable, indicating that it is supposed to follow a distribution and have credible intervals. Besides, Bayesian methods allow us to give a distribution to a parameter based on our experience a priori, and then make adustment based on observed data to get the posterior distribution. The impact of data depends on the sample size.

In this DRP project, we will learn about some basic theoery about Bayesian statistics. For exampe, how to select prior distributions and how to derive posterior distributions. We will also have a chance to explore the relationship between MLE estimators and posterior means.



Wolfgang Brightenburg: Introduction to Gaussian Processes

Prerequisites: linear algebra, recommended comfort with probability to the level of STAT 394, basic understanding of coding in R

In this project, we’ll explore Gaussian processes for machine learning, a beautiful modeling framework which connects to a variety of rich mathematical, computational, and statistical subjects. In brief, Gaussian processes are powerful general function approximators, offering competitive predictive performance while yielding simple uncertainty quantification on their predictions. Using “Gaussian Processes for Machine Learning” by C. E. Rasmussen & C. K. I. Williams as our guide (freely available online by the authors, just give it a quick google!), we’ll start from the fundamentals of Bayesian linear regression, building our intuition and working our way towards deriving Gaussian process models as Bayesian nonparametric regression, in which inference is done over a general function space, a more general paradigm than tuning regression coefficients directly. Throughout our reading, we’ll code up our own models along the way and use them of applications of personal interest. Following our base readings, we can continue the DRP in a variety of directions based on your interests- some potential ideas include learning about the close connections of GPs and neural networks through the Neural Tangent Kernel, retooling GPs for classification, and multitask learning with GPs, but there are many others we can explore!



Yash Talwekar: System Identification Techniques

Prerequisites:

Mathematical descriptions of real world systems, whether biological or engineered, play a crucial role in understanding them, especially if some level of control over that system is desired. Most systems can be described as grey-box, where we can formulate a structure using our knowledge of the system, or as black-box where we have absolutely no information on that structure. Using system identification techniques, we can generate a model based on measured data which is obtained by exciting the system using some prescribed inputs and then recording the outputs.

In this project, we will study some system identification techniques to deduce a linear mathematical structure of a mechanical system. The time-domain input and output signals will be generated in a simulation, and we will explore how black-box models such as ARX and ARMAX fit the data and also investigate how a grey-box model performs by using our insights on the system’s mathematical structure. Core concepts that will be explored include least-squares regression and auto-regression. Some programming will be required.



Yeting Wu: Analysis of Variance in Facebook Friending Behavior

Prerequisites: Suggested Prerequisites are STAT 311, STAT 220, STAT 302, and STAT 316.

This project investigates the factors influencing Facebook friending behavior using Analysis of Variance (ANOVA). We’ll explore how a profile owner’s gender, the evaluator’s gender, and the attractiveness of a profile picture affect the willingness to “friend” someone. Students will learn to fit ANOVA models, interpret results, and understand interactions between variables in a real-world social media context. This project is ideal for students with a basic understanding of statistics who are interested in applying their knowledge to contemporary social issues.



Yuhan Qian: Introduction to Gaussian Process

Prerequisites: Probability at the level of STAT 394 recommended but not necessary; Some experience in R or Python

Everyone knows about Gaussian distribution. Its infinite-dimensional generalization, the Gaussian Process (GP), is also a commonly used tool in supervised machine learning, widely applied in regression and classification tasks. In this project, we will begin by exploring fundamental mathematical concepts and the standard GP model. We will also apply GP to solve some interesting problems.