autumn 2020 Projects

Anna Neufeld: Infectious Disease Modeling

Student: Harper Zhu

Slides | Writeup | External Link

Prerequisites: Some comfort in R; experience with calculus and differential equations will be useful but not required.

We will start by reading introductory material on SIR compartmental models for disease modeling, and will work to implement these models in R. These are deterministic differential equation models whose output depends on knowledge of various input parameters. After becoming comfortable with the models, we will discuss how statisticians estimate the parameters of these models using current outbreak data in the face of uncertainty, and how the models are then used for predictions and forecasting. The project will evolve based on the interest and statistical level of the student, but could potentially culminate in an applied COVID-19 modeling project.

Bryan Martin: Ethics in Data Science and Statistics

Student: Jinghua Sun

Slides | Writeup

Prerequisites: None

In this project, we will discuss ethical questions and issues that arise in the field of statistics and data science. We will read case studies and work together to develop a lesson that can be taught to introductory statistics students as part of an undergraduate curriculum. By the end of this project, I hope to have material that I will use in my own courses! Topics will be driven by the student’s particular interest, but possible topics will include: the history of statistics and eugenics, race and gender in data science, algorithmic fairness, reproducibility and open science, data transparency, privacy, and human subjects data.

Michael Pearce: History and Practice of Data Communication

Student: Ziyi Li

Slides | Writeup

Prerequisites: None; some experience with R or Python may be helpful but is not required.

In this course, we’ll learn about the development of data communication techniques and their modern use. We’ll begin by studying how people have visualized patterns in data over time, and consider how those methods reflected the computational resources available in each era. Then, we’ll shift our attention to modern issues in data communication, drawing examples from the COVID-19 pandemic and 2020 US presidential election: How do practitioners effectively show complex relationships or model uncertainty? How do people mislead readers through text and figures (intentionally or otherwise)? What common pitfalls exist, and how can we avoid them? We’ll finish with a data communication project based on the student’s interests.

Peter Gao: Statistics for Data Journalism: Election Forecasting

Student: Andy Qin

Slides | Writeup

Prerequisites: Experience with introductory stats (at the level of any of the intro classes) would help.

In this project, we’ll take a look at how leading newspapers and researchers conduct polls, forecast elections, and calculate polling averages. If there is interest, we can work on reverse engineering some of the methods used by publications such as FiveThirtyEight, RealClearPolitics, and the Upshot. Finally, we’ll consider the ethics of forecasting elections and using statistics in general to study our election process.

Ronak Mehta: The Magical Properties of the SVD

Student: Claire Gao

Slides | Writeup

Prerequisites: Linear Algebra (Math 308 or equivalent). Some statistical background, preferably at the level of 340.

The singular value decomposition (SVD) of a matrix has wide relevance in virtually all areas of applied mathematics. This project will consist of three parts:

a theory section containing proofs of intriguing properties about the SVD and derivations of three problems of wide importance in statistics and machine learning: principle components analysis (PCA), partial least squares (PLS), and canonical correlations analysis (CCA), all of whose solutions depend heavily on the SVD.
a simulation section demonstrating the bias-variance tradeoff of using on method over another for various regression/classification tasks.
a real data section in which the student interprets the features learned by these methods on a dataset of their choice. This project is ideal for intermediate statistics students who want to make their linear algebra skills airtight and have a strong mathematical foundation for future success in machine learning.

Sarah Teichman: Phylogenetic Trees

Student: Lexi Xia

Slides | Writeup

Prerequisites: An intro stats class. Some R experience is useful but not required.

In this project, we will learn about the application of statistics to evolutionary biology through working with phylogenetic trees. In evolutionary biology, a diagram in the form of a tree is often used to represent the diversification of species over time. In this project, we’ll read chapters from the book \emph{Tree Thinking} and choose a dataset to investigate deeply in R in order to understand phylogenies: what they are, how they are used, and how we can use statistics to construct them and test hypotheses about evolution.

Seth Temple: Statistical Genetics and Identity by Descent

Student: Rachel Ferina

Slides | Writeup

Prerequisites: None; keen interest in the biological sciences

We will explore many classical ways in which statistics has been employed to study heredity in humans and other organisms. For example, we will introduce the expectation-maximization algorithm to infer allele frequencies for ABO blood types and discuss Jacquard’s 9 condensed states of identity by descent. This tutorial will be very practical as we will draw many family trees to compute kinship and inbreeding coefficients. We will use UW emeritus professor Elizabeth Thompson’s monograph “Statistical Inference from Genetic Data on Pedigrees” as reading material. Depending on student interest, we may read more chapters from Thompson’s book, investigate the history of statistical genetics as it relates to the eugenics movement, or code up some computations like the path counting formula.

Shane Lubold: Random Network Models

Student: Peter Liu

Slides | Writeup

Prerequisites: Intro statistics and some programming experience (R or Python).

Network data, which consists of edges or relationships between nodes, is an important type of data. Many statistical models have been proposed to understand and model this type of data. Some are simple models which assume that all actors form connections with the same probability, while others are more complicated and use node-specific characteristics to determine the probability of an edge. In this project we will first review common network models (such as the Erdos-Renyi model, stochastic block model, and latent space model) and discuss why they might be useful in practice. We will then fit these models to data sets from the Stanford Network Analysis Project and discuss why some models fit better than others. The goal of the project is to understand how network data can arise in the real world and how network properties determine which models are reasonable. If we have time, we can also look at dynamic networks (networks that change over time) and see if we can model them using any of the models discussed above.

Subodh Selukar: Introduction to Survival Analysis

Student: Howard Baek

Slides | Writeup | External Link

Prerequisites: Familiarity with R; familiarity with survival analysis

In many applications, researchers are interested in the time it takes for an outcome of interest to occur: for example, time to death by any cause (“overall survival”) is the gold standard outcome for studies in many biomedical fields. Among other characteristics, these data exhibit a special kind of missingness termed “censoring,” which requires the use of different statistical methods than other data types. In this project, the student will learn about the characteristics of time-to-event (or “survival”) data and basic methods for approaching these data.

Zhaoqi Li: Statistical Illusions

Student: Yeji Sohn

Slides | Writeup

Prerequisites: Motivation to think about interesting problems and readiness for the brain to be teased. Some mathematical maturity would be beneficial.

Do you know that there is a “statistically significant” relationship between your salary and if you pee at night? Do you know that you will always wait longer than others at a bus stop? Do you know that a lot of the statistical concepts you learned in class actually don’t make sense? In this quarter-long study, we will dive into some common misconceptions about statistics and the questions of how to interpret statistics. We will touch on a wide range of statistical topics from a paradoxical view and learn the intuition behind them.

No prior knowledge of statistics is required but motivation is encouraged.