spring 2021 Projects


Alan Min and Anupreet Porwal: Expectations and Sampling methods

Student: Kai Gong
Slides | Writeup
Student: Aubrey Yan
Slides | Writeup
Prerequisites: STAT 340-341 and some knowledge of R ; Basics of expectations and probability distributions.

Expectation of a random variable or functions of random variables can be difficult to compute analytically when the probability distribution of those variables are not standard well known distributions. One way to approximate expectations is by “intelligently” drawing samples from the probability distributions. In this project, we will cover several sampling and Monte Carlo methods to draw samples from “difficult distributions” and use these samples to approximate expectations. Particularly, we will look at transformation based sampling, importance sampling, rejection sampling and their popular variants. Finally, we will compare performances of these sampling methods and apply the methodology to a dataset of interest to the student in a Bayesian analysis. We are happy to take two students if more than one person is interested in this project.



Alex Ziyu Jiang: Sampling methods, Markov Chain Monte Carlo and Cryptography

Student: Kathleen Cayha
Slides | Writeup
Prerequisites: Basic knowledge of probability is recommended (STAT 311 level). Some prior coding experience with R would be great but not necessary

In this project we learn how to decipher coded messages with the widely-used Markov Chain Monte Carlo method. We will first go through the basics of Markov Chain model after a quick probability warm-up. After that we will learn how to generate samples from a known distribution using a wide range of techniques. Finally, we will apply our tools to a dataset consisting of coded messages and we will see how the ‘messy’ code will gradually iterate into complete sentences using what we have learned.



Anna Neufeld: Infectious disease modeling

Student: Kayla Kenyon
Slides | Writeup
Prerequisites: Some comfort in R; experience with calculus and differential equations will be useful but not required.

We will start by reading introductory material on SIR compartmental models for disease modeling, and will work to implement these models in R. These are deterministic differential equation models whose output depends on knowledge of various input parameters. After becoming comfortable with the models, we will discuss how statisticians estimate the parameters of these models using current outbreak data in the face of uncertainty, and how the models are then used for predictions and forecasting. The project will evolve based on the interest and statistical level of the student, but could potentially culminate in an applied COVID-19 modeling project.



Apara Venkat: Networks and Choice Modeling

Student: Andrey Risukhin
Slides | Writeup
Prerequisites: Calculus (MATH 126) and exposure to probability theory (STAT 340). Linear Algebra (MATH 308) suggested, but not necessary. A general interest and curiosity about math and the world.

Imagine a grocery store that presents its customers with a multitude of cereal options. A rational customer would want to maximize their utility. How do you define the “utility” of an item? How do you model the decisions of a customer when there are random effects? If we have data from decisions made by customers, can we identify the utility of various cereals? Discrete choice models attempt to answer these questions. In the language of networks, this problem is closely related to ranking. How does Google rank the webpages? How do we rank players in sports tournaments? Recent developments have been unifying these fields.

In this project, we will wrestle with these questions. First, we will learn about discrete choice models. Then, we will learn about ranking in networks. Finally we will attempt to reconcile the two. Time permitting, we will also run simulations and look at datasets along the way.



Michael Pearce: Nonlinear Regression

Student: Muhammad Anas
Slides | Writeup
Prerequisites: A basic knowledge of linear regression and some experience in R

Simple linear regression models can be easy to implement and interpret, but they don’t always fit data well! For this project, we’ll explore regression methods that relax the assumption of linearity. These might include (based on the interest and/or experience level of the student) polynomial regression, step functions, regression splines, smoothing splines, multivariate adaptive regression splines, and generalized additive models. Hopefully, we’ll even see how to validate such models using cross-validation! We will mostly use James et al.’s “An Introduction to Statistical Learning” Chapter 7.



Peter Gao: Ethics of Algorithmic Decision Making

Student: Kevin Hoang
Slides | Writeup
Prerequisites: None

In this project, we’ll discuss ethical issues arising from the use of algorithms in decision making, in fields like medicine, policing, and housing. We’ll talk about issues ranging from algorithmic bias and disparate impact to data privacy. Finally, we’ll introduce statistical definitions of fairness and talk about their benefits and shortcomings. If there’s interest, we can work on simulations/data analysis to evaluate statistical definitions of fairness.



Sarah Teichman: Ethics of Algorithmic Decision Making

Student: Liwen Peng
Slides | Writeup
Prerequisites: None

In this project, we’ll discuss ethical issues arising from the use of algorithms in decision making, in fields like medicine, policing, and housing. We’ll talk about issues ranging from algorithmic bias and disparate impact to data privacy. Finally, we’ll introduce statistical definitions of fairness and talk about their benefits and shortcomings. If there’s interest, we can work on simulations/data analysis to evaluate statistical definitions of fairness.



Taylor Okonek: Disease Mapping

Student: Wuwei Zhang
Slides | Writeup
Prerequisites: STAT 340; Interest in public health applications; Familiarity with R

Disease mapping is an important tool for visualizing spatial data on the prevalence and/or incidence of various diseases. In this project, we’ll discuss different types of spatial data, and explore visualization techniques and their usefulness in conveying relevant information. In particular, we’ll discuss ways to visualize uncertainty in disease mapping, how estimates underlying a disease map inform public health policy, issues with data sparsity and spatial aggregation, and how to obtain the estimates that underlie such maps. We’ll learn about Bayesian hierarchical models and, time permitting, spatial random effect terms. Throughout, we’ll explore concepts using real data from various diseases that are of student interest.