Winter 2022

This is a list of potential topics for Winter 2022 that will be listed on the application, but the actual set of topics that will be offered is subject to change.

If a mentor is listed for multiple project topics, they will choose which project to offer based on the interest of applicants. If you are really interested in a statistical topic that is not listed for this quarter, feel free to let us know on the application- a few mentors are willing to switch their topic if a student has strong interests.

Medha Agarwal: Statistical Simulations

  • Prerequisites: STAT 311, programming experience (preferably in R/Python)
  • This project aims to explore various methods of statistical simulations; their theoretical underpinnings and practical use. We will cover methods of obtaining independent and identically distributed random samples for both continuous and discrete random variable. These include methods like inverse transform, accept-reject, ratio of uniforms, importance sampling etc. During the later parts of the project, we will delve into Markov chain Monte Carlo, a robust method of obtaining correlated random samples from any probability distribution. While MCMC is a rich area in itself (reading about it is highly encouraged), we will cover the two most popular MCMC algorithms - Metropolis-Hastings and Gibbs Sampling. Since simulations is a very programming-centric topic, the project will regularly involve coding the sampling methods covered. These will be short codes for toy examples and will not require very high programming skills.
  • Michael Cunetta: Sabermetrics

  • Prerequisites: Familiarity with the rules of major league baseball. Some familiarity with R.
  • We will read excerpts from "The Book: Playing the Percentages in Baseball" (2007) and carry out our own inference (in R) using baseball datasets. By the end of the project, we will understand core sabermetric principles, we will be critical consumers of baseball analysis, and we will be able to ask and answer our own baseball-related research questions. In April, the student and mentor will go on a field trip to T-Mobile Park to cheer on the Mariners.
  • Nina Galanter: Optimal Treatment Rules: Causal Inference and Statistical Learning

  • Prerequisites: Some familiarity with conditional probability, linear regression, and R.
  • In many biomedical and public health applications of statistics we are interested in determining the best treatment. However, people and their specific situations will vary and in some cases one treatment does not fit all! Instead, we can create a treatment rule which will take in a subject and their variables and predict the best treatment for them. We want to predict this as well as possible, and so we are looking for "optimal" rules. Optimal treatment rules involve both causal inference and statistical learning as we create rules based on estimated treatment effects. This project will first go over causal inference foundations and then explore Q-learning methods for treatment rules, which might include regression, penalized regression, and support vector machines depending on time and the student's background. We will use R to evaluate the methods with simulated and real data. If there is extra time, we could look into classification-based methods or dynamic treatment regimes.
  • Jess Kunke: Survey statistics

  • Prerequisites: The project can be tailored based on the student's background knowledge; some prior exposure to concepts such as mean, variance, and probability would be helpful.
  • How do you analyze survey data? How do you design a survey to address a research question and account for uncertainty in the process? What goes into designing, conducting and analyzing big government surveys like the census? What kinds of surveys are there? These are some of the questions we can explore together. We can learn about some of the approaches to designing and analyzing surveys, and we can pick a data set to analyze. The exact direction can be tailored based on student interest and background.
  • Jess Kunke: Simulating data

  • Prerequisites: The project can be tailored based on the student's background knowledge; some prior exposure to concepts such as mean, variance, and probability would be helpful.
  • In statistics classes we talk about data being drawn from various distributions such as a Bernoulli or a Poisson distribution. If we want to simulate Bernoulli data, we can flip a coin many times and write down the results, but how do we generate data from other distributions? And when we fit a statistical model to data, one way we could examine how well the model fits the data is to simulate data from the model and see whether our data are typical of the kind of data the model generates. But how do you do that? We’ll learn how to use R to simulate data from a particular distribution or model, and we’ll apply it to some examples. The exact direction can be tailored based on student interest and background; final projects could involve conducting a data analysis or designing a tutorial to teach others how and why to simulate data.
  • Nick Irons: Bayesian Data Analysis

  • Prerequisites: Knowledge of probability at the level of STAT 311 and some familiarity with programming.
  • Bayesian statistics is a method of modeling data that synthesizes our prior beliefs about the data with the information contained in the sample to estimate model parameters. Rather than a single point estimate of a parameter, the output of a Bayesian model is a "posterior" distribution which captures the uncertainty in our inferences. Bayesian methods are at the heart of many modern data science and machine learning techniques. In this introduction to Bayesian statistics we will cover conditional distributions, Bayes' theorem, basics of Bayesian modeling, conjugate priors, MCMC sampling, and application to real dataset(s) of interest to the student in R. If time permits, possible further directions include hypothesis testing, linear regression, hierarchical models, Latent Dirichlet Allocation, and the EM algorithm for missing data. The goal of this project is to come away with an understanding of the basic conceptual and technical aspects of Bayesian inference and to get our hands dirty with real and interesting data. Possible data applications include estimating (potentially waning) COVID vaccine efficacy, estimating COVID prevalence over time in Washington state, topic modeling in NLP, or any other dataset of interest to the student.
  • Erin Lipman: Bayesian perspectives on statistical modeling

  • Prerequisites: Some familiarity with multivariate linear regression will be helpful, as will some familiarity with R. Our project can be either more technical or more conceptual depending on the background and interests of the student.
  • Many of the methods we focus on in introductory statistics courses, for example confidence intervals and null hypothesis significance testing, come from the “Frequentist” philosophy of statistics. There is another, increasingly popular, philosophy of statistics called “Bayesian” statistics which has its own ways of conceptualizing and analyzing data. Bayesian statistics views parameters in the world (such as the effect of a medical treatment) as random variables rather than as fixed numbers, and it focuses on synthesizing prior evidence about the distribution of a parameter with information contained in the data. The goal of this project is to gain familiarity with statistical modeling from the Bayesian perspective.
  • Anna Neufeld: Multiple Testing

  • Prerequisites: Stat 311 and some knowledge of R will be helpful, but not required.
  • In an introductory statistics course, you learn how to obtain a p-value to test a single null hypothesis. These p-values are constructed such that, when the null hypothesis is true, you will make a mistake and reject the null only 5% of the time. In the real world, scientists often wish to test thousands of null hypotheses at once. In this setting, making a mistake on 5% of the hypotheses could lead to a very high number of false discoveries. Multiple testing techniques aim to limit the number of mistakes made over a large set of hypotheses without sacrificing too much power. We will start with a review of hypothesis testing, then discuss the challenges posed by large numbers of hypotheses, and finally learn about modern multiple testing techniques. Towards the end of the quarter, we will apply the techniques we learned to real data.
  • Anna Neufeld: Introduction to Clinical Trials

  • Prerequisites: None.
  • Drawing mainly from the textbook "Fundamentals of Clinical Trials" by Friedman et al., we will learn about the design and analysis of clinical trials, with special attention to statistical considerations and the role of statisticians. Pending the interest of the student, for the final project we will either delve into an advanced statistical topic in clinical trials, or we will do a ``case study" where we learn about a recent/current clinical trial in depth.
  • Sarah Teichman: Multivariate Data Analysis

  • Prerequisites: Stat 311, and linear algebra would be helpful but not necessary
  • Almost all datasets collected across disciplines are multivariate, which means that multiple variables are measured. Recent technological advances have let researchers collect datasets with hundreds or thousands of variables. Methods from introductory statistics can be used to measure the relationship between a small subset of variables, but new methods are required to consider all of the data simultaneously. Multivariate data analysis is a set of tools to visualize, explore, and make inference about this type of data. In this project, we will use the textbook "An Introduction to Applied Multivariate Analysis with R" to learn about several methods for multivariate data analysis, including principal components analysis, multidimensional scaling, and clustering. We will choose a dataset of interest at the beginning and apply each of our methods to this dataset, leading to a final data analysis and comparison across methods.
  • Seth Temple: Statistical Genetics I: Pedigrees and Relatedness

  • Prerequisites: STAT 311, and some programming experience
  • We will explore statistical theory and methodology as it applies to the study of (human) heredity. The overarching theme of the readings are (1) to compute measures of relatedness (kinship and inbreeding) and conditional trait (disease) risk based on known family trees and (2) to estimate relatedness given dense SNP or entire genome sequence data. Readings will follow UW emeritus professor Elizabeth Thompson’s monograph “Statistical Inference from Genetic on Pedigrees”. We will cover conditional probabilities, likelihood models, Hardy-Weinberg equilibrium, the expectation-maximization algorithm to infer allele frequencies for the ABO blood group, Wright’s path counting formula, and identity by descent. During meetings we will work through practice exercises; for 1 or 2 meetings we will go through brief hands-on labs using current research software. More details on this recurring DRP may be found here: https://sdtemple.github.io/statgen1.
  • Seth Temple: Statistical Genetics II: Genome-wide Association Studies

  • Prerequisites: STAT 340/1/2, or a course in linear modeling (regression analysis)
  • We will build working knowledge about statistical genetics, linear modeling, and hypothesis testing to better understand the landmark papers Yu et al. (2006) and Klein et al. (2005). These papers introduce genome-wide association testing, a methodological paradigm that has led to the discovery of thousands of associations between gene markers and traits. In surveying these advances, we will learn about the theory of linear models, mixed models, principal components, multiple testing, and permutation testing. One meeting will concern the difference between causation and association, especially considering statistical genetics long coupled history with eugenics. The final project will involve conducting a genome-wide association study using the publicly available 1000 Genomes database. References may include Professor Timothy Thornton’s lecture slides and “Overview of Statistical Methods for Genome-Wide Association Studies (GWAS)” from Genome-Wide Association Studies and Genomic Prediction (2013).
  • Seth Temple: Statistical Genetics III: Markov Models

  • Prerequisites: STAT 394/395, or a math course in stochastic processes
  • The Markov assumption is a common simplification taken in statistical genetics and other fields. Broadly speaking, the assumption is that the future depends only on the most recent past. We will study this formally in the context of probability models, with readings from Richard Durrett’s "Essentials of Stochastic Processes" textbook. Motivating examples from statistical genetics will be drawn from Richard Durbin’s "Biological Sequence Analysis" textbook. These include hidden Markov models, which underly the sequence alignment for whole genomes, the Beagle phasing software, and inference of historical effective population sizes. If time permits, we may read about Poisson processes and their application to coalescent trees.
  • Drew Wise: Introduction to Nonparametric Statistics

  • Prerequisites: An introductory statistics class is all that's needed. Some programming experience would be a plus.
  • Many of the methods studied in an introductory statistics class — z-scores and t-tests, for example — rely on assumptions not always met by the data. The purpose of this project is to expose the student to nonparametric statistical tests, a class of techniques that are more broadly applicable. We will begin by discussing the advantages and disadvantages of nonparametric tests, and then we will study the tests themselves: Wilcoxon signed-rank tests, Mann-Whitney U-tests, and Kruskal-Wallis H-tests, among others. There is flexibility in the topics depending on student interest!