Spring 2022

Andrea Boskovic and Harshil Desai: NBA Analytics and Machine Learning

  • Prerequisites: Some experience in R or Python; some knowledge about basketball
  • Have you ever wondered how to predict which NBA rookie will become an all star or wondered how teams choose which players to draft? In this project, we will explore NBA data to make a model that predicts something related to basketball. We will start with an introduction to basic machine learning models, learn how to implement models in R or Python, and evaluate the models we've created. Potential directions could include (but are definitely not limited to) ranking players based on box scores and advanced stats, predicting who will be the MVP, or predicting a team's odds of making the playoffs in a given year. We are willing to mentor two students!
  • Nina Galanter: Optimal Treatment Rules: Causal Inference and Statistical Learning

    Student: Max Bi
    Slides , Writeup

  • Prerequisites: Some familiarity with conditional probability, linear regression, and R
  • In many biomedical and public health applications of statistics we are interested in determining the best treatment. However, people and their specific situations will vary and in some cases one treatment does not fit all! Instead, we can create a treatment rule which will take in a subject and their variables and predict the best treatment for them. Optimal treatment rules involve both causal inference and statistical learning as we create rules based on estimated treatment effects. This project will first go over causal inference foundations and then explore Q-learning methods for treatment rules, which might include regression, penalized regression, or generalized additive models depending on time and the student's background. We will use R to evaluate the methods with simulated data.
  • Anna Neufeld and Alan Min: Introduction to Computational Biology

    Student: Wei Jun Tan
    Slides , Writeup
    Student: Iris Zhou
    Slides , Writeup

  • Prerequisites: Programming experience (preferably in R). Knowledge of probability distributions at the level of Math/Stat 394 or Stat 340 is preferred but not required.
  • Given massive amounts of data available from next generation genome sequencing, sequence alignment methods are necessary to align genomic reads to reference genomes. Alignment tools make it possible to identify genetic variation and mutation leading to biological discovery. We plan to work with the textbook "Computational Genome Analysis," by Deonier, Waterman, and Tavare (available for free online). We will start with some background reading on necessary biological context, and then we will read about statistical concepts related to sequence alignment problems that are common in modern computational biology. After gaining this necessary background, we will learn about modern algorithms for sequence alignment. We are hoping to mentor two students!
  • Reading and Research Opportunity on Voting

    Mentors: Prof. Elena Erosheva, Michael Pearce, Prof. Conor Mayo-Wilson

    Students: Minghe (Mia) Zhang and Man (Terry) Yuan
    Slides , Writeup
  • Prerequisites: Prerequisites: Computational skills (R required; other knowledge and experience, e.g., with python, is desirable). Preference given to Statistics and CSE majors and to candidates with interest and possibility to continue with the project in Summer and Fall 2022
  • In peer review settings, groups or panels of experts are tasked with evaluating submissions such as grant proposals or job candidate materials. For each submission, individual input is often given as a numeric score or a letter grade. The average or median of such scores is often used to summarize the collective opinion of a panel of experts. In this project, we will consider other ways to aggregate expert opinions by drawing a parallel between panel decisions and elections or voting. All voting procedures have two key features: types of input that are used and how these inputs are aggregated. Examples of voting procedures include majority rule, Borda rule, single transferrable vote, and majority judgement. Voting procedures matter in that a choice of voting procedure can change panel outcomes or which candidate(s) or proposal(s) are preferred. Social choice theory demonstrates that (a) no voting procedure for selection of one out of three or more choices can satisfy simultaneously a small number of natural desiderata (this result is known as Arrow's Impossibility Theorem), that (b) every voting procedure satisfy some desiderata but not others, and that (c) election outcomes can differ depending on what voting system is used. The points (a)-(c) constitute compelling reasons in favor of better understanding the influence of aggregation methods on panel-level outcomes: we will critically assess properties of voting procedures and whether these properties should be required or desired in panel opinion aggregation methods used in peer review. The project will involve applying social choice algorithms (e.g., Borda rule and Majority Judgement) to de-identified data on panel grant peer review.
  • Antonio Olivas: Estimation for cancer screening models using deconvolution

  • Prerequisites: Calculus (MATH 126) and exposure to probability theory (STAT 340).
  • Cancer screening programs are an important component for secondary cancer prevention. To understand the conditions under which a cancer screening program provides the best benefit, mathematical models are used to estimate relevant quantities using information from cancer screening trials. In the natural history of a cancer, the time to cancer onset (subclinical) and the sojourn/latent time (time between onset and clinical appearance) are two quantities of interest, but impossible to know separately. However, by using a screening tool we obtain some information that allow us to differentiate between these two components. In this project we will study a mathematical model that uses information at the aggregated level from a cancer screening trial to estimate mean time to onset, mean sojourn time, and sensitivity of the screening test, via the deconvolution formula and maximum likelihood estimation.
  • Rrita Zejnullahi: Introduction to Human Rights Statistics

    Student: Cindy Elder
    Slides , Writeup

  • Prerequisites: Some exposure to survey sampling and regression analysis.
  • In this DRP project, we consider the application of statistics methodology to Human Rights. Topics include missing females, criminal justice, violence against women, hunger and poverty. By the end of the project, we will be able to describe ways that statistical methods can be applied to human rights problems and identify areas that need development of new methods. In the first half, we will read and discuss research papers. In the latter half, we will pick a paper to replicate, with the exact choice of topic at student’s discretion. This project will be mostly remote (meetings via zoom!)
  • Winter 2022

    Medha Agarwal: Statistical Simulations

    Student: Evana Sorfina Mohd Nazri
    Slides , Writeup

  • Prerequisites: STAT 311, programming experience (preferably in R/Python)
  • This project aims to explore various methods of statistical simulations; their theoretical underpinnings and practical use. We will cover methods of obtaining independent and identically distributed random samples for both continuous and discrete random variable. These include methods like inverse transform, accept-reject, ratio of uniforms, importance sampling etc. During the later parts of the project, we will delve into Markov chain Monte Carlo, a robust method of obtaining correlated random samples from any probability distribution. While MCMC is a rich area in itself (reading about it is highly encouraged), we will cover the two most popular MCMC algorithms - Metropolis-Hastings and Gibbs Sampling. Since simulations is a very programming-centric topic, the project will regularly involve coding the sampling methods covered. These will be short codes for toy examples and will not require very high programming skills.
  • Michael Cunetta: Sabermetrics

    Student: David Wang
    Slides , Writeup

  • Prerequisites: Familiarity with the rules of major league baseball. Some familiarity with R.
  • We will read excerpts from "The Book: Playing the Percentages in Baseball" (2007) and carry out our own inference (in R) using baseball datasets. By the end of the project, we will understand core sabermetric principles, we will be critical consumers of baseball analysis, and we will be able to ask and answer our own baseball-related research questions. In April, the student and mentor will go on a field trip to T-Mobile Park to cheer on the Mariners.
  • Nina Galanter: Optimal Treatment Rules: Causal Inference and Statistical Learning

    Student: Leah Jia
    Slides , Writeup

  • Prerequisites: Some familiarity with conditional probability, linear regression, and R.
  • In many biomedical and public health applications of statistics we are interested in determining the best treatment. However, people and their specific situations will vary and in some cases one treatment does not fit all! Instead, we can create a treatment rule which will take in a subject and their variables and predict the best treatment for them. We want to predict this as well as possible, and so we are looking for "optimal" rules. Optimal treatment rules involve both causal inference and statistical learning as we create rules based on estimated treatment effects. This project will first go over causal inference foundations and then explore Q-learning methods for treatment rules, which might include regression, penalized regression, and support vector machines depending on time and the student's background. We will use R to evaluate the methods with simulated and real data. If there is extra time, we could look into classification-based methods or dynamic treatment regimes.
  • Jess Kunke: Survey statistics and R

    Student: Mekias Kebede
    Slides , Writeup

  • Prerequisites: The project can be tailored based on the student's background knowledge; some prior exposure to concepts such as mean, variance, and probability would be helpful.
  • How do you analyze survey data? How do you design a survey to address a research question and account for uncertainty in the process? What goes into designing, conducting and analyzing big government surveys like the census? What kinds of surveys are there? These are some of the questions we can explore together. We can learn about some of the approaches to designing and analyzing surveys, and we can pick a data set to analyze. The exact direction can be tailored based on student interest and background.
  • Nick Irons: Bayesian Data Analysis

    Student: Qianqian (Emma) Yu
    Slides , Writeup

  • Prerequisites: Knowledge of probability at the level of STAT 311 and some familiarity with programming.
  • Bayesian statistics is a method of modeling data that synthesizes our prior beliefs about the data with the information contained in the sample to estimate model parameters. Rather than a single point estimate of a parameter, the output of a Bayesian model is a "posterior" distribution which captures the uncertainty in our inferences. Bayesian methods are at the heart of many modern data science and machine learning techniques. In this introduction to Bayesian statistics we will cover conditional distributions, Bayes' theorem, basics of Bayesian modeling, conjugate priors, MCMC sampling, and application to real dataset(s) of interest to the student in R. If time permits, possible further directions include hypothesis testing, linear regression, hierarchical models, Latent Dirichlet Allocation, and the EM algorithm for missing data. The goal of this project is to come away with an understanding of the basic conceptual and technical aspects of Bayesian inference and to get our hands dirty with real and interesting data. Possible data applications include estimating (potentially waning) COVID vaccine efficacy, estimating COVID prevalence over time in Washington state, topic modeling in NLP, or any other dataset of interest to the student.
  • Erin Lipman: Bayesian perspectives on statistical modeling

    Student: Zhengyang (Anthony) Xu
    Slides , Writeup

  • Prerequisites: Some familiarity with multivariate linear regression will be helpful, as will some familiarity with R. Our project can be either more technical or more conceptual depending on the background and interests of the student.
  • Many of the methods we focus on in introductory statistics courses, for example confidence intervals and null hypothesis significance testing, come from the “Frequentist” philosophy of statistics. There is another, increasingly popular, philosophy of statistics called “Bayesian” statistics which has its own ways of conceptualizing and analyzing data. Bayesian statistics views parameters in the world (such as the effect of a medical treatment) as random variables rather than as fixed numbers, and it focuses on synthesizing prior evidence about the distribution of a parameter with information contained in the data. The goal of this project is to gain familiarity with statistical modeling from the Bayesian perspective.
  • Anna Neufeld: Introduction to Clinical Trials

    Student: Hisham Bhatti
    Slides , Writeup

  • Prerequisites: None.
  • Drawing mainly from the textbook "Fundamentals of Clinical Trials" by Friedman et al., we will learn about the design and analysis of clinical trials, with special attention to statistical considerations and the role of statisticians. Pending the interest of the student, for the final project we will either delve into an advanced statistical topic in clinical trials, or we will do a ``case study" where we learn about a recent/current clinical trial in depth.
  • Sarah Teichman: Multivariate Data Analysis

    Student: Huong Ngo
    Slides , Writeup

  • Prerequisites: Stat 311, and linear algebra would be helpful but not necessary
  • Almost all datasets collected across disciplines are multivariate, which means that multiple variables are measured. Recent technological advances have let researchers collect datasets with hundreds or thousands of variables. Methods from introductory statistics can be used to measure the relationship between a small subset of variables, but new methods are required to consider all of the data simultaneously. Multivariate data analysis is a set of tools to visualize, explore, and make inference about this type of data. In this project, we will use the textbook "An Introduction to Applied Multivariate Analysis with R" to learn about several methods for multivariate data analysis, including principal components analysis, multidimensional scaling, and clustering. We will choose a dataset of interest at the beginning and apply each of our methods to this dataset, leading to a final data analysis and comparison across methods.
  • Seth Temple: Statistical Genetics I: Pedigrees and Relatedness

    Student: Saleh Wehelie
    Slides , Writeup

  • Prerequisites: STAT 311, and some programming experience
  • We will explore statistical theory and methodology as it applies to the study of (human) heredity. The overarching theme of the readings are (1) to compute measures of relatedness (kinship and inbreeding) and conditional trait (disease) risk based on known family trees and (2) to estimate relatedness given dense SNP or entire genome sequence data. Readings will follow UW emeritus professor Elizabeth Thompson’s monograph “Statistical Inference from Genetic on Pedigrees”. We will cover conditional probabilities, likelihood models, Hardy-Weinberg equilibrium, the expectation-maximization algorithm to infer allele frequencies for the ABO blood group, Wright’s path counting formula, and identity by descent. During meetings we will work through practice exercises; for 1 or 2 meetings we will go through brief hands-on labs using current research software. More details on this recurring DRP may be found here: https://sdtemple.github.io/statgen1.
  • Drew Wise: Introduction to Nonparametric Statistics

    Student: Xinyi (Vicky) Xiang
    Slides , Writeup

  • Prerequisites: An introductory statistics class is all that's needed. Some programming experience would be a plus.
  • Many of the methods studied in an introductory statistics class — z-scores and t-tests, for example — rely on assumptions not always met by the data. The purpose of this project is to expose the student to nonparametric statistical tests, a class of techniques that are more broadly applicable. We will begin by discussing the advantages and disadvantages of nonparametric tests, and then we will study the tests themselves: Wilcoxon signed-rank tests, Mann-Whitney U-tests, and Kruskal-Wallis H-tests, among others. There is flexibility in the topics depending on student interest!
  • Autumn 2021

    Nick Irons: Introduction to Bayesian Data Analysis

    Student: Xuweiyi (William) Chen
    Slides , Writeup

  • Prerequisites: Knowledge of expectations and probability distributions at the level of STAT 340-341 and some knowledge of R.
  • Bayesian statistics is a method of modeling data that synthesizes our prior beliefs about the data with the information contained in the sample to estimate model parameters. Rather than a single point estimate of a parameter, the output of a Bayesian model is a "posterior" distribution which captures the uncertainty in our inferences. Bayesian methods are at the heart of many modern data science and machine learning techniques. In this introduction to Bayesian statistics we will cover conditional distributions, Bayes' theorem, basics of Bayesian modeling, conjugate priors, MCMC sampling, and application to real dataset(s) of interest to the student in R. If time permits, possible further directions include hypothesis testing, linear regression, hierarchical models, and the EM algorithm for missing data. The goal of this project is to come away with an understanding of the basic conceptual and technical aspects of Bayesian inference and to get our hands dirty with real and interesting data. Possible data applications include estimating (potentially waning) COVID vaccine efficacy, estimating COVID prevalence over time in Washington state, or any other dataset of interest to the student.
  • Alex Ziyu Jiang: Clustering and music genre classification

    Student: Yitong (Eva) Shan
    Slides , Writeup

  • Prerequisites: Knowledge of probability at the level of Stat 311 or beyond; Some coding experiences, preferably in Python or R; It will be fantastic if you also happen to like listening to music ;)
  • Have you ever been amazed by the sheer amount of music genres in your Spotify or Apple Music App and would like to know about their differences in a quantitative way? In this project you will learn how to process audio data and use some interesting clustering techniques in machine learning to classify songs into different genres.
  • David Marcano and Daniel Suen: Cluster Analysis

    Students: Townson Cocke and Renee Chien
    Renee Slides , Renee Writeup

  • Prerequisites: Basic knowledge of R or Python, statistical background equivalent to STAT 311 is recommended
  • In many real-world data applications, from medicine to finance, it is of interest to find groups within the data. Clustering is an unsupervised learning approach for separating data into representative groups. How to find and assess the quality of these discovered clusters is a vast area of modern research. In this project, we will survey several popular clustering techniques and utilize them in simulated and real datasets. In particular, we will explore center-based approaches such as the k-means algorithm, dissimilarity-based approaches such as hierarchical clustering, probability-based approaches such as mixture models, and other techniques based on student interest. We will also look at how to assess a given clustering. The topics covered and their depth will develop based on the interest and statistical/mathematical level of the student. We are happy to take two students if more than one person is interested in this project.
  • Anna Neufeld: Multiple Testing

    Student: Cathy Qi
    Slides , Writeup

  • Prerequisites: Stat 311 and some knowledge of R will be helpful, but not required.
  • In an introductory statistics course, you learn how to obtain a p-value to test a single null hypothesis. These p-values are constructed such that, when the null hypothesis is true, you will make a mistake and reject the null only 5% of the time. In the real world, scientists often wish to test thousands of null hypotheses at once. In this setting, making a mistake on 5% of the hypotheses could lead to a very high number of false discoveries. Multiple testing techniques aim to limit the number of mistakes made over a large set of hypotheses without sacrificing too much power. We will start with a review of hypothesis testing, then discuss the challenges posed by large numbers of hypotheses, and finally learn about modern multiple testing techniques. Towards the end of the quarter, we will apply the techniques we learned to real data.
  • Michael Pearce: Voting, Ranking, and Preference Modeling

    Student: Carolina Sawyer
    Slides , Writeup

  • Prerequisites: Stat 311 or equivalent
  • Preference data appears in many forms: voters deciding between candidates in an election, movie critics rating new releases, and search engines ranking web pages, to name a few! However, modeling preferences in a statistical manner can be challenging for a variety of reasons, such as computational difficulties in working with discrete and high-dimensional data. In this project, we will study a variety of models used for preference data, which includes both ranking and scoring models. Understanding challenges and uncertainty in aggregating preferences will be a key focus. Together, we will also carry out an applied project on preference data based on the student's interests.
  • Seth Temple: Statistical Genetics I, Pedigrees and Relatedness

    Student: Michael Yung
    Slides , Writeup

  • Prerequisites: STAT 311; some programming experience preferred
  • We will explore statistical theory and methodology as it applies to the study of (human) heredity. The overarching theme of the readings are (1) to compute measures of relatedness (kinship and inbreeding) and conditional trait (disease) risk based on known family trees and (2) to estimate relatedness given dense SNP or entire genome sequence data. Readings will follow UW emeritus professor Elizabeth Thompson’s monograph “Statistical Inference from Genetic on Pedigrees”. We will cover conditional probabilities, likelihood models, Hardy-Weinberg equilibrium, the expectation-maximization algorithm to infer allele frequencies for the ABO blood group, Wright’s path counting formula, and identity by descent. During meetings we will work through practice exercises; for 1 or 2 meetings we will go through brief hands-on labs using current research software. The final project may involve estimating familial relationships among individuals in the 1000 Genomes database and comparing outputs among various statistical software.
  • Vydhourie R.T. Thiyageswaran: Graph Clustering

    Student: Dawei Wang
    Slides , Writeup

  • Prerequisites: Introductory Linear Algebra (and interest in basic introductory graph theory would be helpful)
  • We will explore clustering methods in graphs. We will focus on k-means clustering, and spectral clustering. Additionally, we would spend some time looking at applications, by thinking about studies explored in statistical blog entries, for example, in FiveThirtyEight. If there’s interest, we can look into replicating and extending on some of the ideas in these studies.
  • Steven Wilkins-Reeves: An Introduction to Causal Inference and Sensitivity Analysis

    Student: Hadi Nazirool Bin Yusri
    Slides , Writeup

  • Prerequisites: Stat 311 (would be useful to have familiarity with linear regression)
  • Randomized controlled trials are often called the “gold standard” for assessing the effect of a treatment on an outcome. However, for many scientific questions, a randomized controlled trial may be either unethical (i.e. you can’t force someone to smoke to figure out whether it causes cancer), or down right impossible (i.e. you can’t assign someone a higher birth weight). Techniques from causal inference can help us to estimate these treatment effect using only observational data, and some identifying assumptions. Sensitivity analysis can tell us how robust our conclusions are to violations of those assumptions. In this project, you will read parts of Causal Inference: A Primer by Judea Pearl, as well as some papers on the topic. A final project may involve analyzing an observational dataset of your choice applying causal inference and sensitivity analysis techniques.
  • Kenny Zhang: Basics of Causal Inference

    Student: Qiguang Yan
    Slides , Writeup

  • Prerequisites: STAT 311 level statistics, some familarity with regression is a plus.
  • "Correlation is not causation" used to prevent statisticians from answering questions like "Will smoking cause Lung cancer?". However, with the tool of causal inference and the emergence of big data, we are able to answer some of the questions on a firm scientific basis. We can use causal inference to look at a variety of topics including vaccination, genes etc.
  • Spring 2021

    Peter Gao: Ethics of Algorithmic Decision Making

    Student: Kevin Hoang
    Slides , Writeup

  • Prerequisites: None
  • In this project, we'll discuss ethical issues arising from the use of algorithms in decision making, in fields like medicine, policing, and housing. We'll talk about issues ranging from algorithmic bias and disparate impact to data privacy. Finally, we'll introduce statistical definitions of fairness and talk about their benefits and shortcomings. If there's interest, we can work on simulations/data analysis to evaluate statistical definitions of fairness.
  • Alex Ziyu Jiang: Sampling methods, Markov Chain Monte Carlo and Cryptography

    Student: Kathleen Cayha
    Slides , Writeup

  • Prerequisites: Basic knowledge of probability is recommended (STAT 311 level). Some prior coding experience with R would be great but not necessary
  • In this project we learn how to decipher coded messages with the widely-used Markov Chain Monte Carlo method. We will first go through the basics of Markov Chain model after a quick probability warm-up. After that we will learn how to generate samples from a known distribution using a wide range of techniques. Finally, we will apply our tools to a dataset consisting of coded messages and we will see how the 'messy' code will gradually iterate into complete sentences using what we have learned.
  • Alan Min and Anupreet Porwal: Expectations and Sampling methods

    Students: Kai Gong and Aubrey Yan
    Kai Slides , Kai Writeup
    Aubrey Slides , Aubrey Writeup

  • Prerequisites: STAT 340-341 and some knowledge of R ; Basics of expectations and probability distributions.
  • Expectation of a random variable or functions of random variables can be difficult to compute analytically when the probability distribution of those variables are not standard well known distributions. One way to approximate expectations is by “intelligently” drawing samples from the probability distributions. In this project, we will cover several sampling and Monte Carlo methods to draw samples from “difficult distributions” and use these samples to approximate expectations. Particularly, we will look at transformation based sampling, importance sampling, rejection sampling and their popular variants. Finally, we will compare performances of these sampling methods and apply the methodology to a dataset of interest to the student in a Bayesian analysis. We are happy to take two students if more than one person is interested in this project.
  • Anna Neufeld: Infectious disease modeling

    Student: Kayla Kenyon
    Slides , Writeup

  • Prerequisites: Some comfort in R; experience with calculus and differential equations will be useful but not required.
  • We will start by reading introductory material on SIR compartmental models for disease modeling, and will work to implement these models in R. These are deterministic differential equation models whose output depends on knowledge of various input parameters. After becoming comfortable with the models, we will discuss how statisticians estimate the parameters of these models using current outbreak data in the face of uncertainty, and how the models are then used for predictions and forecasting. The project will evolve based on the interest and statistical level of the student, but could potentially culminate in an applied COVID-19 modeling project.
  • Michael Pearce: Nonlinear Regression

    Student: Muhammad Anas
    Slides , Writeup

  • Prerequisites: A basic knowledge of linear regression and some experience in R
  • Simple linear regression models can be easy to implement and interpret, but they don't always fit data well! For this project, we'll explore regression methods that relax the assumption of linearity. These might include (based on the interest and/or experience level of the student) polynomial regression, step functions, regression splines, smoothing splines, multivariate adaptive regression splines, and generalized additive models. Hopefully, we'll even see how to validate such models using cross-validation! We will mostly use James et al.'s "An Introduction to Statistical Learning" Chapter 7.
  • Taylor Okonek: Disease Mapping

    Student: Wuwei Zhang
    Slides , Writeup

  • Prerequisites: STAT 340; Interest in public health applications; Familiarity with R
  • Disease mapping is an important tool for visualizing spatial data on the prevalence and/or incidence of various diseases. In this project, we’ll discuss different types of spatial data, and explore visualization techniques and their usefulness in conveying relevant information. In particular, we’ll discuss ways to visualize uncertainty in disease mapping, how estimates underlying a disease map inform public health policy, issues with data sparsity and spatial aggregation, and how to obtain the estimates that underlie such maps. We’ll learn about Bayesian hierarchical models and, time permitting, spatial random effect terms. Throughout, we’ll explore concepts using real data from various diseases that are of student interest.
  • Sarah Teichman: Ethics of Algorithmic Decision Making

    Student: Liwen Peng
    Slides , Writeup

  • Prerequisites: None
  • In this project, we'll discuss ethical issues arising from the use of algorithms in decision making, in fields like medicine, policing, and housing. We'll talk about issues ranging from algorithmic bias and disparate impact to data privacy. Finally, we'll introduce statistical definitions of fairness and talk about their benefits and shortcomings. If there's interest, we can work on simulations/data analysis to evaluate statistical definitions of fairness.
  • Apara Venkat: Networks and Choice Modeling

    Student: Andrey Risukhin
    Slides , Writeup

  • Prerequisites: Calculus (MATH 126) and exposure to probability theory (STAT 340). Linear Algebra (MATH 308) suggested, but not necessary. A general interest and curiosity about math and the world.
  • Imagine a grocery store that presents its customers with a multitude of cereal options. A rational customer would want to maximize their utility. How do you define the "utility" of an item? How do you model the decisions of a customer when there are random effects? If we have data from decisions made by customers, can we identify the utility of various cereals? Discrete choice models attempt to answer these questions. In the language of networks, this problem is closely related to ranking. How does Google rank the webpages? How do we rank players in sports tournaments? Recent developments have been unifying these fields. In this project, we will wrestle with these questions. First, we will learn about discrete choice models. Then, we will learn about ranking in networks. Finally we will attempt to reconcile the two. Time permitting, we will also run simulations and look at datasets along the way.
  • Winter 2021

    Peter Gao: Survey of Data Journalism

    Student: Rohini Mettu
    Slides , Writeup

  • Prerequisites: None
  • In this project, we'll take a look at recent uses of data and statistics in journalism and discuss their effectiveness in applying statistical methods and communicating results to their readers. If desired, we can focus on a specific area of application (climate change, economics, sports, epidemiology). If there is interest, we can look into replicating and extending a particular example.
  • Richard Guo: Making probability rigorous

    Student: Mark Lamin

  • Prerequisites: Probability theory at the level of MATH/STAT 394 and 395
  • Having sat through the introductory probability course, likely you have heard things like "Lebesgue measure", "sigma algebra", "almost sure convergence" and even "martingale" being mentioned. Do you wonder what they are and why they matter? This is a reading program that will introduce these notions and make the probability you learned *rigorous*. We will read together the acclaimed monograph "Probability with Martingales" by David Williams. Rigorous treatment of probability and measure theory will prepare you for more advanced topics, such as stochastic processes, learning theory and theoretical statistics.
  • Bryan Martin: Statistical Learning with Sparsity

    Student: Jerry Su

  • Prerequisites: Familiarity with regression, up to a STAT 311 level
  • Many modern applications benefit from the principle of less is more. Whether due to practical computation concerns from big data, overfitting concerns from too many parameters, or estimability concerns from a small sample size, statistical models often require sparsity. Sparsity can improve our predictions, help make the patterns we observe in our data more reproducible, and give our model parameters desirable properties. Often, sparsity is imposed through penalization, where we include a term in our model to enforce that some parameters are set equal to zero. We will learn about some of the statistical theory underlying how penalization works, and how it impacts our model output, both mathematically and computationally. We will also learn about and compare different sparsity schemes, such as lasso, group lasso, elastic net, and more. We will focus on understanding the different settings in which we might be interested in different forms of sparsity and apply these tools to real data.
  • Eric Morenz and Yiqun Chen: See what's not there

    Student: Suh Young Choi
    Slides , Writeup

  • Prerequisites: Experience with linear regression, probability, or data manipulation will allow a deep dive into the content. It is not a requirement for students who are interested in the subject.
  • In this project, we will take a look at the concept of identification in the context of missing data (and causal inference, if time permits or there is interest!!). While no glamorous artificial intelligence buzzwords are involved in the project per se, remember that your model is just as good as your data (and as we will see by the end of the quarter, as good as your identification assumptions!). We will be drawing from various sources (e.g., Chapter 6 in Foundations of Agonistic Statistics) in the hope of flexible schedule/materials given your background and interest. We will consider a few empirical problems as well, from TidyTuesday data sets to political polls.
  • Taylor Okonek: Topics in Biostatistics

    Student: Anna Elias-Warren
    Slides , Writeup

  • Prerequisites: Introductory statistics a plus but not required, interest in public health applications
  • In this project, we’ll first broadly discuss some of the main pillars of the field of biostatistics, and then focus on a more specific topic for the remainder of the quarter. The main pillars we'll discuss include design of clinical trials, survival analysis, and infectious disease modeling. The focused part of this project can be greatly driven by student interest. Possible directions include: doing a deep-dive into the design of COVID-19 vaccine trials; reading articles about and discussing implications of communicating public health analyses to the public; gaining a broad understanding of how infectious disease models have influenced policy throughout the world; reading about ethical issues in global health studies; and more!
  • Michael Pearce: Nonlinear Regression

    Student: Alejandro Gonzalez
    Slides , Writeup

  • Prerequisites: A basic knowledge of linear regression and some experience in R
  • Simple linear regression models can be easy to implement and interpret, but they don't always fit data well! For this project, we'll explore regression methods that relax the assumption of linearity. These might include (based on the interest and/or experience level of the student) polynomial regression, step functions, regression splines, smoothing splines, multivariate adaptive regression splines, and generalized additive models. Hopefully, we'll even see how to validate such models using cross-validation! We will mostly use James et al.'s "An Introduction to Statistical Learning" Chapter 7.
  • Sarah Teichman: Multivariate Data Analysis

    Student: Lindsey Gao
    Slides , Writeup

  • Prerequisites: Stat 311, and linear algebra would be helpful but not necessary
  • Almost all datasets collected across disciplines are multivariate, which means that multiple variables are measured. Recent technological advances have let researchers collect datasets with hundreds or thousands of variables. Methods from introductory statistics can be used to measure the relationship between a small subset of variables, but new methods are required to consider all of the data simultaneously. Multivariate data analysis is a set of tools to visualize, explore, and make inference about this type of data. In this project, we will use the textbook "An Introduction to Applied Multivariate Analysis with R" to learn about several methods for multivariate data analysis, including principal components analysis, multidimensional scaling, and clustering. We will choose a dataset of interest at the beginning and apply each of our methods to this dataset, leading to a final data analysis and comparison across methods.
  • Seth Temple: Statistical Genetics and Identity by Descent

    Student: Selma Chihab
    Slides , Writeup

  • Prerequisites: STAT 311; some programming experience preferred
  • We will explore statistical theory and methodology as it applies to the study of (human) heredity. The overarching theme of the readings are (1) to compute measures of relatedness (kinship and inbreeding) and conditional trait (disease) risk based on known family trees and (2) to estimate relatedness given dense SNP or entire genome sequence data. Readings will follow UW emeritus professor Elizabeth Thompson’s monograph “Statistical Inference from Genetic on Pedigrees”. We will cover conditional probabilities, likelihood models, Hardy-Weinberg equilibrium, the expectation-maximization algorithm to infer allele frequencies for the ABO blood group, Wright’s path counting formula, and identity by descent. During meetings we will work through practice exercises; for 1 or 2 meetings we will go through brief hands-on labs using current research software
  • Apara Venkat: Networks and Choice Modeling

    Student: Xuling Yang
    Slides , Writeup

  • Prerequisites: Calculus (MATH 126) and exposure to probability theory (STAT 340). Linear Algebra (MATH 308) suggested, but not necessary. A general interest and curiosity about math and the world.
  • Imagine a grocery store that presents its customers with a multitude of cereal options. A rational customer would want to maximize their utility. How do you define the "utility" of an item? How do you model the decisions of a customer when there are random effects? If we have data from decisions made by customers, can we identify the utility of various cereals? Discrete choice models attempt to answer these questions. In the language of networks, this problem is closely related to ranking. How does Google rank the webpages? How do we rank players in sports tournaments? Recent developments have been unifying these fields. In this project, we will wrestle with these questions. First, we will learn about discrete choice models. Then, we will learn about ranking in networks. Finally we will attempt to reconcile the two. Time permitting, we will also run simulations and look at datasets along the way.
  • Jerry Wei: Topological Data Analysis

    Student: Joia Zhang
    Slides , Writeup

  • Prerequisites: Exposure to probability theory and linear algebra
  • Topological Data Analysis (TDA) broadly is about data analysis methods that find structure in data. This includes a lot of topics, and we will focus on clustering and nonlinear dimension reduction. We will study some textbook chapters and some classical papers. We may also go into mode estimation and manifold estimation if interested.
  • Kenny Zhang: Deep Learning for Computer Vision

    Student: Angela Zhao
    Slides , Writeup

  • Prerequisites: Proficiency in a programming language (preferably python). Some exposure in basic probability rules and computer science would be helpful.
  • Image data is prevalent in our lives and modern deep learning provides a powerful tool to deal with image data. We will start with logistic regression and study what is a neural network. Then we will move on to convolutional neural network and some coding exercises. If time allowed, we can delve more into the state-of-the-art Generative Adversarial Networks (GAN) and more complicated tasks like segmentation.
  • Autumn 2020

    Peter Gao: Statistics for Data Journalism: Election Forecasting

    Student: Andy Qin
    Slides

  • Prerequisites: Experience with introductory stats (at the level of any of the intro classes) would help.
  • In this project, we'll take a look at how leading newspapers and researchers conduct polls, forecast elections, and calculate polling averages. If there is interest, we can work on reverse engineering some of the methods used by publications such as FiveThirtyEight, RealClearPolitics, and the Upshot. Finally, we'll consider the ethics of forecasting elections and using statistics in general to study our election process.
  • Zhaoqi Li: Statistical Illusions

    Student: Yeji Sohn
    Slides

  • Prerequisites: Motivation to think about interesting problems and readiness for the brain to be teased. Some mathematical maturity would be beneficial.
  • Do you know that there is a “statistically significant” relationship between your salary and if you pee at night? Do you know that you will always wait longer than others at a bus stop? Do you know that a lot of the statistical concepts you learned in class actually don’t make sense? In this quarter-long study, we will dive into some common misconceptions about statistics and the questions of how to interpret statistics. We will touch on a wide range of statistical topics from a paradoxical view and learn the intuition behind them. No prior knowledge of statistics is required but motivation is encouraged.
  • Shane Lubold: Random Network Models

    Student: Peter Liu
    Slides

  • Prerequisites: Intro statistics and some programming experience (R or Python).
  • Network data, which consists of edges or relationships between nodes, is an important type of data. Many statistical models have been proposed to understand and model this type of data. Some are simple models which assume that all actors form connections with the same probability, while others are more complicated and use node-specific characteristics to determine the probability of an edge. In this project we will first review common network models (such as the Erdos-Renyi model, stochastic block model, and latent space model) and discuss why they might be useful in practice. We will then fit these models to data sets from the Stanford Network Analysis Project and discuss why some models fit better than others. The goal of the project is to understand how network data can arise in the real world and how network properties determine which models are reasonable. If we have time, we can also look at dynamic networks (networks that change over time) and see if we can model them using any of the models discussed above.
  • Bryan Martin: Ethics in Data Science and Statistics

    Student: Jinghua Sun
    Slides

  • Prerequisites: None
  • In this project, we will discuss ethical questions and issues that arise in the field of statistics and data science. We will read case studies and work together to develop a lesson that can be taught to introductory statistics students as part of an undergraduate curriculum. By the end of this project, I hope to have material that I will use in my own courses! Topics will be driven by the student's particular interest, but possible topics will include: the history of statistics and eugenics, race and gender in data science, algorithmic fairness, reproducibility and open science, data transparency, privacy, and human subjects data.
  • Ronak Mehta: The Magical Properties of the SVD

    Student: Claire Gao

  • Prerequisites: Linear Algebra (Math 308 or equivalent). Some statistical background, preferably at the level of 340.
  • The singular value decomposition (SVD) of a matrix has wide relevance in virtually all areas of applied mathematics. This project will consist of three parts: 1) a theory section containing proofs of intriguing properties about the SVD and derivations of three problems of wide importance in statistics and machine learning: principle components analysis (PCA), partial least squares (PLS), and canonical correlations analysis (CCA), all of whose solutions depend heavily on the SVD. 2) a simulation section demonstrating the bias-variance tradeoff of using on method over another for various regression/classification tasks. 3) a real data section in which the student interprets the features learned by these methods on a dataset of their choice. This project is ideal for intermediate statistics students who want to make their linear algebra skills airtight and have a strong mathematical foundation for future success in machine learning.
  • Anna Neufeld: Infectious Disease Modeling

    Student: Harper Zhu
    Slides , Shiny App

  • Prerequisites: Some comfort in R; experience with calculus and differential equations will be useful but not required.
  • We will start by reading introductory material on SIR compartmental models for disease modeling, and will work to implement these models in R. These are deterministic differential equation models whose output depends on knowledge of various input parameters. After becoming comfortable with the models, we will discuss how statisticians estimate the parameters of these models using current outbreak data in the face of uncertainty, and how the models are then used for predictions and forecasting. The project will evolve based on the interest and statistical level of the student, but could potentially culminate in an applied COVID-19 modeling project.
  • Michael Pearce: History and Practice of Data Communication

    Student: Ziyi Li
    Writeup

  • Prerequisites: None; some experience with R or Python may be helpful but is not required.
  • In this course, we'll learn about the development of data communication techniques and their modern use. We'll begin by studying how people have visualized patterns in data over time, and consider how those methods reflected the computational resources available in each era. Then, we'll shift our attention to modern issues in data communication, drawing examples from the COVID-19 pandemic and 2020 US presidential election: How do practitioners effectively show complex relationships or model uncertainty? How do people mislead readers through text and figures (intentionally or otherwise)? What common pitfalls exist, and how can we avoid them? We'll finish with a data communication project based on the student's interests.
  • Subodh Selukar: Introduction to Survival Analysis

    Student: Howard Baek
    Writeup , Shiny App

  • Prerequisites: Familiarity with R; familiarity with survival analysis
  • In many applications, researchers are interested in the time it takes for an outcome of interest to occur: for example, time to death by any cause ("overall survival") is the gold standard outcome for studies in many biomedical fields. Among other characteristics, these data exhibit a special kind of missingness termed "censoring," which requires the use of different statistical methods than other data types. In this project, the student will learn about the characteristics of time-to-event (or "survival") data and basic methods for approaching these data.
  • Sarah Teichman: Phylogenetic Trees

    Student: Lexi Xia
    Writeup

  • Prerequisites: An intro stats class. Some R experience is useful but not required.
  • In this project, we will learn about the application of statistics to evolutionary biology through working with phylogenetic trees. In evolutionary biology, a diagram in the form of a tree is often used to represent the diversification of species over time. In this project, we'll read chapters from the book \emph{Tree Thinking} and choose a dataset to investigate deeply in R in order to understand phylogenies: what they are, how they are used, and how we can use statistics to construct them and test hypotheses about evolution.
  • Seth Temple: Statistical Genetics and Identity by Descent

    Student: Rachel Ferina
    Slides Writeup

  • Prerequisites: None; keen interest in the biological sciences
  • We will explore many classical ways in which statistics has been employed to study heredity in humans and other organisms. For example, we will introduce the expectation-maximization algorithm to infer allele frequencies for ABO blood types and discuss Jacquard’s 9 condensed states of identity by descent. This tutorial will be very practical as we will draw many family trees to compute kinship and inbreeding coefficients. We will use UW emeritus professor Elizabeth Thompson’s monograph "Statistical Inference from Genetic Data on Pedigrees" as reading material. Depending on student interest, we may read more chapters from Thompson’s book, investigate the history of statistical genetics as it relates to the eugenics movement, or code up some computations like the path counting formula.
  • Spring 2020

    We ran a limited number of projects due to COVID-19.

    Sheridan Grant: Causal Inference: DAGs and Potential Outcomes

    Student: Grace Shen
    Slides

  • Prerequisites: Familiarity with linear regression and facility with Gaussian distributions (preferably multivariate)
  • This project will be reading-focused, rather than data analysis. It's intended for a junior or senior student who is interested in learning about Causal Inference--a huge topic in graduate-level statistics and stats research--perhaps as a prelude to applying for PhDs. You'll read classic papers and parts of textbooks on two approaches to causal inference, potential outcomes & graphs. For the final presentation, you'll contrast the two approaches as applied to a problem (practical or theoretical) of your choice.
  • Shane Lubold: Random Graphs

    Student: Gordon An

  • Prerequisites: Some exposure to probability. Some exposure to, or an interest in, graph theory.
  • In this project, we will study random graph theory and how the behavior of these graphs change as the size of the graph grows. We will focus primarily on a simple graph model with a number of interesting properties, the Erdös-Rényi model. In this simple model, we generate a graph on n nodes, where each node connects to any other node with probability p(n), which can depend on the graph size n. We will use theory and simulations to derive key properties of this model, such as the distribution of the degree of a vertex or the number of cliques of any size. We will also explore other exciting properties of this model. For example, if the ratio p*n grows at a certain rate as n gets big, then the graph will, for example, exhibit large cliques with probability 1. The proof of these ideas uses only basic statistical ideas. We will prove the conditions that guarantee this behavior and use simulations to explore how large the graphs must be the see this behavior. This project will expose students to the exciting field of random graphs and will give them a good understanding of how simple statistical tools can answer complex questions.
  • Anna Neufeld: Disease Modeling

    Student: Rachael Ren
    Writeup , Slides

  • Prerequisites: Knowledge of R will be useful!
  • We will start by reading introductory material on SIR compartmental models for disease modeling, and will work to implement these models in R. These are deterministic differential equation models whose output depends on knowledge of various input parameters. After becoming comfortable with the models, we will discuss how statisticians estimate the parameters of these models using current outbreak data in the face of uncertainty, and how the models are then used for predictions and forecasting. The project will evolve based on the interest of the student and relevant currnt events.
  • Winter 2020

    Peter Gao: Introduction to Gaussian Processes

    Student: Hannah Chang

  • Prerequisites: None; interest in programming encouraged
  • As a concept, the Gaussian distribution, often referred to as the normal distribution or the bell curve, has cemented itself in the public consciousness. But what about its finite dimensional generalization, the multivariate Gaussian? Or its infinite dimensional counterpart, the Gaussian process? This project has two main aims: first, to discuss and explore how Gaussian processes arise in various subfields like machine learning and spatial statistics; and second, to develop notes (or a website) that explain Gaussian processes to a general audience. Of course, the exact focus of this project is flexible, based on the reader's interests/background.
  • Kristof Glauninger: Nonparametric Regression

    Student: Eli Grosman
    Writeup

  • Prerequisites: Familiarity with linear regression and basic probability, comfort with algebra, some calculus
  • Nonparametric statistical methods have seen an explosion in popularity as datasets have increased in size and complexity. The goal of this project will be to introduce students who are familiar with parametric regression models to a nonparametric setting. We will explore some of the basic theory and applications of these models, as well as an interesting case where we can achieve parametric convergence rates in a nonparametric setting.
  • Zhaoqi Li: Statistical Machine Learning and Data Analysis

    Student: Zhijun Peng
    Writeup

  • Prerequisites Knowledge of probability theory and Maximum Likelihood Estimation at the level of Stat 340 is preferred; some familiarity of basic programming is preferred; an enthusiasm of reading and experimenting is encouraged.
  • We will discuss the relationship between statistics and machine learning, one of the most popular fields in the world, and how statistical techniques could be used in the machine learning framework. Topics may include classifiers (e.g., Decision Tree, Naive Bayes), training (what is training and the relation to likelihood inference), etc. The design could range from experimental to theoretical, depending on the background of the student.
  • Shane Lubold: Random Graphs

    Student: Tahmin Talukder

  • Prerequisites: Some exposure to probability. Some exposure to, or an interest in, graph theory.
  • In this project, we will study random graph theory and how the behavior of these graphs change as the size of the graph grows. We will focus primarily on a simple graph model with a number of interesting properties, the Erdös-Rényi model. In this simple model, we generate a graph on n nodes, where each node connects to any other node with probability p(n), which can depend on the graph size n. We will use theory and simulations to derive key properties of this model, such as the distribution of the degree of a vertex or the number of cliques of any size. We will also explore other exciting properties of this model. For example, if the ratio p*n grows at a certain rate as n gets big, then the graph will, for example, exhibit large cliques with probability 1. The proof of these ideas uses only basic statistical ideas. We will prove the conditions that guarantee this behavior and use simulations to explore how large the graphs must be the see this behavior. This project will expose students to the exciting field of random graphs and will give them a good understanding of how simple statistical tools can answer complex questions.
  • Bryan Martin: R Package Development

    Student: Thomas Serrano
    Writeup

  • Prerequisites: Familiarity with R
  • Reproducible statistical analysis depends on good software and coding practices. In this project, we will learn how to go from users of R packages to developers of R packages. We will also practice and implement general software developer skills, including documentation, version control, and unit testing. We will learn how to make our code robust, efficient, and user-friendly. Ideally, you will start with an idea of something you are interested in implementing as an R package, whether it be a statistical model, data analysis application, or anything else, though this is not required!
  • Anna Neufeld: Statistical Natural Language Processing

    Student: Christina Nick

  • Prerequisites: Proficiency in a programming language. Knowledge of basic probability rules at the level of Stat 311.
  • In most statistics classes, the data you work with are numbers. Text documents such as books, articles, and speeches provide massive sources of data that can not be analyzed using the tools from your introductory statistics courses. We will explore the field of statistical natural language processing and discuss classification and clustering techniques for text data. Applications of such techniques include translation, information retrieval, fake news detection, and sentiment analysis. After reviewing the literature to get a sense of the general techniques in NLP, we will select a particular text dataset and research question and work on an applied project.
  • Michael Pearce: Nonlinear Regression

    Student: Oliver Bejar Tjalve
    Writeup

  • Prerequisites: A basic knowledge of linear regression and some experience in R
  • Simple linear regression models can be easy to implement and interpret, but they don't always fit data well! For this project, we'll explore regression methods that relax the assumption of linearity. These might include (based on the interest and/or experience level of the student) polynomial regression, step functions, regression splines, smoothing splines, multivariate adaptive regression splines, and generalized additive models. Hopefully, we'll even see how to validate such models using cross-validation! We will mostly use James et al.'s "An Introduction to Statistical Learning" Chapter 7.
  • Anupreet Porwal: Bayesian Linear regression and applications

    Student: Yuchen Sun
    Writeup

  • Prerequisites: Basic knowledge of probability distributions at the level of Stat 394 or Stat 340. Knowledge of Linear Algebra is essential for this project. Familiarity with a programming language may be helpful.
  • Often when we fit models to practical applications, we have some prior understanding of the context of the problem/field which could potentially be useful to tune our model along with the data. For example, if you are trying to model the reply times of emails from dept. chair to professors, information about the designation of professors (full-time/assistant) can be helpful information. Bayesian statistics provides a formal way to incorporate our prior beliefs and information into the model and is particularly useful as it accurately helps to quantify the uncertainty in our inferences. In this project, we wish to discuss basics of Bayes theorem, Bayesian version of Linear regression and if time permits, we will learn about probabilistic matrix factorization (Recommendation systems) and apply these techniques to an interesting problem.
  • Sarah Teichman: Networks

    Student: Josiah Thulin
    Writeup

  • Prerequisites: Stat 311. Some R is useful but not required.
  • Most of the data that you see in STAT 311 are assumed to be independent. However, a lot of interesting datasets include information about individual observations and the relationships between them. This type of data can be analyzed as networks, in which nodes represent individuals and edges represent relationships between them. Networks can be used to study interactions between social groups, the spread of contagious diseases, biological cycles, etc. We will use the textbook "Statistical Analysis of Network Data," along with it's companion text "Statistical Analysis of Network Data in R" by Eric D. Kolaczyk. We will additionally read one or two papers about an application of network analysis and/or analyze a small network in R (based on interest of the student).