Mentors and Project Descriptions
spring 2026 Projects
Andrew Zhang: Optimization and Sampling
Consider the problem of sampling from a distribution P, and suppose that we have a function f(Q) which measures the discrepancy between Q and the target P. For this notion of discrepancy to be meaningful, f should be minimized at P. The problem of sampling from P can therefore be thought of as trying to minimize f. In euclidean space, the fundamental optimization algorithm is gradient descent, but it is far from clear how to develop gradient descent on the space of probability measures. Remarkably, there is a way to do gradient descent on f: this fact is known as “Langevin dynamics is the Wasserstein gradient flow of relative entropy.” We will try to understand what this means. Along the way, we will learn bits of optimization, stochastic calculus, optimal transport, and Riemannian geometry.
Bumjun Park: High-Dimensional Statistics
High-dimensional statistics focuses on the unique challenges that arise when the number of variables (p) in a dataset is comparable to, or sometimes larger than, the number of observations (n). In such cases, traditional statistical tools like OLS fail. The focus of high-dimensional statistics is to introduce powerful regularization techniques like the Lasso and Ridge regression to recover signals from vast amounts of data. This topic provides a bridge between classical theory and modern data.
Juejue Wang: Sensitivity Analysis for Difference-in-Differences Designs
Difference-in-differences (DiD) is one of the most widely used methods for drawing causal inferences from observational data in the empirical sciences. However, its key assumption, parallel trends, cannot be directly tested, which makes sensitivity analysis important for reliable conclusions.
In this DRP, we will read important papers on DiD, omitted variable bias, and debiased machine learning. The goal is to learn how to apply machine learning methods to do sensitivity analysis for DiD. We will then apply (and potentially extend) existing code implementing these methods to a real dataset (e.g., the effect of Medicaid expansion on state spending on Medicaid, the effect of an increase in the minimum wage on teen employment).
Students who do particularly strong work on this project may be invited to continue with us on related research projects.
Kat Hoffman: Foundations of Causal Inference
Through guided readings and weekly discussions, participants will learn the conceptual foundations of causal reasoning, including counterfactual frameworks, confounding, identification, and common causal estimands. We will also explore core methodological approaches such as regression adjustment, matching, inverse probability weighting, and causal diagrams. The DRP is designed to build intuition about when causal conclusions are justified and how causal methods are applied in practice, with examples drawn from epidemiology, social science, and related fields. No prior knowledge in causal inference is required, though familiarity with basic statistics is recommended. Projects and readings can be scaled up or down for students dependent on their interests and prior knowledge. Programming experience is not required, but can be incorporated as a learning tool if the student finds it useful.
Keunwoo Lim: Generation of Medical Time-Series Data
This project focuses on the implementation of a generative model for medical time-series data. We first review the recent literature on time-series generative models, medical time-series data analysis, and medical time-series data generation. Then, we implement the actual model and perform statistical analysis on the generated data. The choice of the generative model and time-series data might depend on the student’s background. One reference could be “Medical Time-Series Data Generation Using Generative Adversarial Networks.”
Nolan Cole: Shape Constrained Inference
Monotonic functions arise in many scientific settings. For example, the amount of antibodies in the bloodstream after vaccination is often hypothesized to be a non-decreasing function of vaccine dose, reflecting an expected dose–response relationship. Updating classical estimation and inference methods for monotonic functions to modern datasets is an active area of research. In this directed reading, the student will review assumption-lean algorithms for estimating monotonic functions and, given time, explore approaches to statistical inference under monotonicity constraints.
Rui Wang: Introduction to Survival Analysis
This project will introduce the fundamental concepts of survival analysis. Potential topics include probability distributions for time-to-event data, the Kaplan–Meier estimator, the log-rank test, and Cox regression models. Depending on the mentee’s technical background and interests, the project may be oriented toward either theoretical development or applied data analysis.
Yuhan Qian: From Small Language Models to AI Agents
Large language models (LLMs) have enabled new AI systems that can reason, plan, and interact with tools. In this project, we will explore both how language models are trained and how they can be used to build agentic workflows. We will start by studying a minimal implementation of a small language model and learning the basics of transformers and language model training. We will then explore how language models can be integrated into systems that perform multi-step reasoning and call external tools. The project will be hands-on and based on two open-source resources: minimind and hello-agents. The goal is to gain a practical understanding of modern LLM systems and how they can be used to build agents.