winter 2021 Projects


Apara Venkat: Networks and Choice Modeling

Student: Xuling Yang
Slides | Writeup
Prerequisites: Calculus (MATH 126) and exposure to probability theory (STAT 340). Linear Algebra (MATH 308) suggested, but not necessary. A general interest and curiosity about math and the world.

Imagine a grocery store that presents its customers with a multitude of cereal options. A rational customer would want to maximize their utility. How do you define the “utility” of an item? How do you model the decisions of a customer when there are random effects? If we have data from decisions made by customers, can we identify the utility of various cereals? Discrete choice models attempt to answer these questions. In the language of networks, this problem is closely related to ranking. How does Google rank the webpages? How do we rank players in sports tournaments? Recent developments have been unifying these fields.

In this project, we will wrestle with these questions. First, we will learn about discrete choice models. Then, we will learn about ranking in networks. Finally we will attempt to reconcile the two. Time permitting, we will also run simulations and look at datasets along the way.



Bryan Martin: Statistical Learning with Sparsity

Student: Jerry Su
Slides | Writeup
Prerequisites: Familiarity with regression, up to a STAT 311 level

Many modern applications benefit from the principle of less is more. Whether due to practical computation concerns from big data, overfitting concerns from too many parameters, or estimability concerns from a small sample size, statistical models often require sparsity. Sparsity can improve our predictions, help make the patterns we observe in our data more reproducible, and give our model parameters desirable properties.

Often, sparsity is imposed through penalization, where we include a term in our model to enforce that some parameters are set equal to zero. We will learn about some of the statistical theory underlying how penalization works, and how it impacts our model output, both mathematically and computationally. We will also learn about and compare different sparsity schemes, such as lasso, group lasso, elastic net, and more. We will focus on understanding the different settings in which we might be interested in different forms of sparsity and apply these tools to real data.



Eric Morenz and Yiqun Chen: See what's not there

Student: Suh Young Choi
Slides | Writeup
Prerequisites: Experience with linear regression, probability, or data manipulation will allow a deep dive into the content. It is not a requirement for students who are interested in the subject.

In this project, we will take a look at the concept of identification in the context of missing data (and causal inference, if time permits or there is interest!!). While no glamorous artificial intelligence buzzwords are involved in the project per se, remember that your model is just as good as your data (and as we will see by the end of the quarter, as good as your identification assumptions!). We will be drawing from various sources (e.g., Chapter 6 in Foundations of Agonistic Statistics) in the hope of flexible schedule/materials given your background and interest. We will consider a few empirical problems as well, from TidyTuesday data sets to political polls.



Jerry Wei: Topological Data Analysis

Student: Joia Zhang
Slides | Writeup
Prerequisites: Exposure to probability theory and linear algebra

Topological Data Analysis (TDA) broadly is about data analysis methods that find structure in data. This includes a lot of topics, and we will focus on clustering and nonlinear dimension reduction. We will study some textbook chapters and some classical papers. We may also go into mode estimation and manifold estimation if interested.



Kenny Zhang: Deep Learning for Computer Vision

Student: Angela Zhao
Slides | Writeup
Prerequisites: Proficiency in a programming language (preferably python). Some exposure in basic probability rules and computer science would be helpful.

Image data is prevalent in our lives and modern deep learning provides a powerful tool to deal with image data. We will start with logistic regression and study what is a neural network. Then we will move on to convolutional neural network and some coding exercises. If time allowed, we can delve more into the state-of-the-art Generative Adversarial Networks (GAN) and more complicated tasks like segmentation.



Michael Pearce: Nonlinear Regression

Student: Alejandro Gonzalez
Slides | Writeup
Prerequisites: A basic knowledge of linear regression and some experience in R

Simple linear regression models can be easy to implement and interpret, but they don’t always fit data well! For this project, we’ll explore regression methods that relax the assumption of linearity. These might include (based on the interest and/or experience level of the student) polynomial regression, step functions, regression splines, smoothing splines, multivariate adaptive regression splines, and generalized additive models. Hopefully, we’ll even see how to validate such models using cross-validation! We will mostly use James et al.’s “An Introduction to Statistical Learning” Chapter 7.



Peter Gao: Survey of Data Journalism

Student: Rohini Mettu
Slides | Writeup
Prerequisites: None

In this project, we’ll take a look at recent uses of data and statistics in journalism and discuss their effectiveness in applying statistical methods and communicating results to their readers. If desired, we can focus on a specific area of application (climate change, economics, sports, epidemiology). If there is interest, we can look into replicating and extending a particular example.



Richard Guo: Making probability rigorous

Student: Mark Lamin
Slides | Writeup
Prerequisites: Probability theory at the level of MATH/STAT 394 and 395

Having sat through the introductory probability course, likely you have heard things like “Lebesgue measure”, “sigma algebra”, “almost sure convergence” and even “martingale” being mentioned. Do you wonder what they are and why they matter? This is a reading program that will introduce these notions and make the probability you learned rigorous. We will read together the acclaimed monograph “Probability with Martingales” by David Williams. Rigorous treatment of probability and measure theory will prepare you for more advanced topics, such as stochastic processes, learning theory and theoretical statistics.



Sarah Teichman: Multivariate Data Analysis

Student: Lindsey Gao
Slides | Writeup
Prerequisites: Stat 311, and linear algebra would be helpful but not necessary

Almost all datasets collected across disciplines are multivariate, which means that multiple variables are measured. Recent technological advances have let researchers collect datasets with hundreds or thousands of variables. Methods from introductory statistics can be used to measure the relationship between a small subset of variables, but new methods are required to consider all of the data simultaneously. Multivariate data analysis is a set of tools to visualize, explore, and make inference about this type of data. In this project, we will use the textbook “An Introduction to Applied Multivariate Analysis with R” to learn about several methods for multivariate data analysis, including principal components analysis, multidimensional scaling, and clustering. We will choose a dataset of interest at the beginning and apply each of our methods to this dataset, leading to a final data analysis and comparison across methods.



Seth Temple: Statistical Genetics and Identity by Descent

Student: Selma Chihab
Slides | Writeup
Prerequisites: STAT 311; some programming experience preferred

We will explore statistical theory and methodology as it applies to the study of (human) heredity. The overarching theme of the readings are (1) to compute measures of relatedness (kinship and inbreeding) and conditional trait (disease) risk based on known family trees and (2) to estimate relatedness given dense SNP or entire genome sequence data. Readings will follow UW emeritus professor Elizabeth Thompson’s monograph “Statistical Inference from Genetic on Pedigrees”. We will cover conditional probabilities, likelihood models, Hardy-Weinberg equilibrium, the expectation-maximization algorithm to infer allele frequencies for the ABO blood group, Wright’s path counting formula, and identity by descent. During meetings we will work through practice exercises; for 1 or 2 meetings we will go through brief hands-on labs using current research software



Taylor Okonek: Topics in Biostatistics

Student: Anna Elias-Warren
Slides | Writeup
Prerequisites: Introductory statistics a plus but not required, interest in public health applications

In this project, we’ll first broadly discuss some of the main pillars of the field of biostatistics, and then focus on a more specific topic for the remainder of the quarter. The main pillars we’ll discuss include design of clinical trials, survival analysis, and infectious disease modeling. The focused part of this project can be greatly driven by student interest. Possible directions include: doing a deep-dive into the design of COVID-19 vaccine trials; reading articles about and discussing implications of communicating public health analyses to the public; gaining a broad understanding of how infectious disease models have influenced policy throughout the world; reading about ethical issues in global health studies; and more!