autumn 2023 Projects


Andrea Boskovic: Introduction to Spatial Survey Sampling

Student: Andrew Sousa
Slides | Writeup
Prerequisites: STAT 311 or equivalent and some coding knowledge

In this DRP, we will investigate spatial survey sampling methods. We will start by reviewing some survey sampling techniques and then discuss how we can do statistical inference based on our surveys. Generally, the two paths for doing so are design-based methods and model-based methods. We will talk about when certain methods are appropriate and the benefits and drawbacks of each.



Antonio Olivas: Statistical evaluation of medical tests for classification and prediction

Student: Yuning Hu
Slides | Writeup
Prerequisites: None

In medicine, there exist many medical tests for diagnosing a disease or for learning about an individual’s prognosis once a diagnosis has been established. However, how do we know how accurate are those tests to diagnose the disease they are supposed to diagnose? Also, when there is more than one diagnostic test for the same disease, how do we know which one is better? Moreover, when the diagnostic test corresponds to a continuous variable, how do we know the threshold to differentiate between having or not having the disease?

In this project we will learn how to evaluate the performance of continuous medical tests using the receiving operating characteristic (ROC) curve. The ROC curve is very popular in medicine because it conveys graphically the performance of the test. Using properties of the ROC curve, we will learn different ways of comparing two or more medical tests, and different ways of choosing the optimal threshold based on the condition of interest.

If time permits and there is interest of the student, we can also develop a score for screening or prognosis and assess its performance using the ROC curve.



Ellen Graham: Introduction to Longitudinal Data Analysis

Student: Zihang Wang
Slides | Writeup
Prerequisites: Familiarity with linear regression, for example via STAT 311, familiarity with R

In many medicine, public health, and many other disciplines, we are often interested in understanding how participants change over time. For example, does a new drug slow the decline in lung function of people with lung disease? Are cognitive trajectories different for people with different types of dementia?

To answer these questions we require data collected over time, longitudinal data. Traditional regression methods fail in this context, as repeated observations of a single participant are correlated. In this project, we’ll learn about the challenges that arise from longitudinal data, and how regression approaches can be adapted to this context, such as linear mixed effects models and generalized estimating equations. We may cover other topics depending on time and student interest.



Erin Lipman: Bayesian perspectives on probability and statistics: Part 2

Student: Leila Peitsch
Slides | Writeup
Prerequisites: This is a continuation of a DRP project from Spring 2023, and is not open for new applications

The simplest Bayesian models are those that use a conjugate prior, meaning that the posterior distribution can be computed directly (i.e. we can write down its probability density function). In most real-world analyses however, especially those with more than one parameter, this is not the case. In general, the posterior distribution has to be approximated in another way, usually using a method called Markov Chain Monte Carlo. In this continuation of a Spring 2023 DRP, we will learn about and implement methods for sampling from the posterior distribution of a Bayesian model.



Ethan Ancell: Information Theory

Student: Abigail Cummings
Slides | Writeup
Prerequisites: A background in probability at the level of Stat 394, and a little bit of coding experience.

This directed reading project will be an introduction to information theory: the mathematical study of the communication of information. We will explore topics such as entropy, mutual information, channel capacity, relative entropy, KL divergence, etc. We will also explore the impact of these ideas in the field of statistics and other related fields.



Nina Galanter: An Overview of Survival Analysis

Student: Daoming Liu
Slides | Writeup
Prerequisites: Some knowledge of R or another programming language, understanding of expected value and conditional probability, some familiarity with linear regression

In medicine and public health, we are often interested in answering questions about the time until an event occurs. For example, what is the median recovery time from some surgery? Or: does a treatment prolong the time until death for patients with a particular cancer? Because of this, Survival Analysis, which works with these time-to-event outcomes, is an important area of Biostatistics. Most time-to-event data is censored - we cannot observe the event for everyone because we lose track of some subjects or something else happens to them. In this project, we will learn about survival analysis methods for censored data, including Kaplan-Meier curves, the Logrank test, and Cox regression. We may cover other topics based on time and student interest. This project will culminate in either a real data analysis using a dataset of the student’s choice or a simulation study.



Nolan Cole: An Introduction to Targeted Maximum Likelihood Estimation

Student: Xiqian Yuan
Slides | Writeup
Prerequisites: Exposure to advanced probability/statistical theory. Bonus points for any experience with R!

Did you ever think in your intro stats that there is no way that most real-world data follow a standard normal or Poisson distribution? You would be right!

Targeted Maximum Likelihood Estimation (TMLE) is a semiparametric statistical framework for using arbitrary machine learning algorithms to estimate and perform statistical inference parameters of interest - all with minimal assumptions. This project will serve as an introduction to this exciting area. Depending on the student’s background and interests, we will read papers to obtain a broad understanding of how TMLE works and (time permitting) apply TMLE to real data.



Ronan Perry: Network statistics

Student: Guanhua Chen
Slides | Writeup
Prerequisites: Familiarity with the basics of probability and comfortability programming in Python.

A guided reading of http://docs.neurodata.io/graph-stats-book/coverpage.html. The student will read chapters, discussing the contents with the instructor and re-implementing the Python examples. Time and interest dependent, the student can visualize and analyze a network dataset of their own.



Ronan Perry: The basics of difference-in-difference analyses

Prerequisites: Familiarity with the basics of probability and statistics. Basic understanding of R and ideally knowledge of regression.

Difference-in-differences is a popular social science method for estimating the effect of policy decisions from observational (non-experimental) data. We will read and discuss a subset of the contents of https://diff.healthpolicydatascience.org/#introduction, and implement examples in R from https://bookdown.org/mike/data_analysis/difference-in-differences.html.



Vydhourie Thiyageswaran: Electrical Resistance and Graphs

Student: Samuel Hsu
Slides | Writeup
Prerequisites: Strong background in linear algebra

We will study information flow on networks by studying resistance in electrical networks. We will also look at research papers and their approach to maximizing information flow in this setting.



Zhaoxing Wu: Classify High-Dimensional Data

Student: Elvin Liu
Slides | Writeup
Prerequisites: Some programming experience with R

In machine learning, classification is a task that assigns a class label to examples from the problem domain. However, high dimensionality poses significant statistical challenges and renders many traditional classification algorithms impractical to use. In this project, we will first learn or review (depends on student’s background) some classical supervised classification techniques and discuss the curse of dimensionality. Next, we will mainly explore Penalized Discriminant Analysis (PDA) which is designed to classify high-dimensional data as an extension of the classical Linear Discriminant Analysis. It classifies data by finding the optimal lower-dimension projections that reveal “interesting structures” in the original dataset. If time permits, students will implement PDA to analyze a real-life dataset of the student’s choice or some simple toy examples.