winter 2022 Projects

Anna Neufeld: Introduction to Clinical Trials

Student: Hisham Bhatti

Slides | Writeup

Prerequisites: None.

Drawing mainly from the textbook “Fundamentals of Clinical Trials” by Friedman et al., we will learn about the design and analysis of clinical trials, with special attention to statistical considerations and the role of statisticians. Pending the interest of the student, for the final project we will either delve into an advanced statistical topic in clinical trials, or we will do a ``case study” where we learn about a recent/current clinical trial in depth.

Drew Wise: Introduction to Nonparametric Statistics

Student: Xinyi (Vicky) Xiang

Slides | Writeup

Prerequisites: An introductory statistics class is all that's needed. Some programming experience would be a plus.

Many of the methods studied in an introductory statistics class — z-scores and t-tests, for example — rely on assumptions not always met by the data. The purpose of this project is to expose the student to nonparametric statistical tests, a class of techniques that are more broadly applicable. We will begin by discussing the advantages and disadvantages of nonparametric tests, and then we will study the tests themselves: Wilcoxon signed-rank tests, Mann-Whitney U-tests, and Kruskal-Wallis H-tests, among others. There is flexibility in the topics depending on student interest!

Erin Lipman: Bayesian perspectives on statistical modeling

Student: Zhengyang (Anthony) Xu

Slides | Writeup

Prerequisites: Some familiarity with multivariate linear regression will be helpful, as will some familiarity with R. Our project can be either more technical or more conceptual depending on the background and interests of the student.

Many of the methods we focus on in introductory statistics courses, for example confidence intervals and null hypothesis significance testing, come from the “Frequentist” philosophy of statistics. There is another, increasingly popular, philosophy of statistics called “Bayesian” statistics which has its own ways of conceptualizing and analyzing data. Bayesian statistics views parameters in the world (such as the effect of a medical treatment) as random variables rather than as fixed numbers, and it focuses on synthesizing prior evidence about the distribution of a parameter with information contained in the data. The goal of this project is to gain familiarity with statistical modeling from the Bayesian perspective.

Jess Kunke: Survey statistics and R

Student: Mekias Kebede

Slides | Writeup

Prerequisites: The project can be tailored based on the student's background knowledge; some prior exposure to concepts such as mean, variance, and probability would be helpful.

How do you analyze survey data? How do you design a survey to address a research question and account for uncertainty in the process? What goes into designing, conducting and analyzing big government surveys like the census? What kinds of surveys are there? These are some of the questions we can explore together. We can learn about some of the approaches to designing and analyzing surveys, and we can pick a data set to analyze. The exact direction can be tailored based on student interest and background.

Medha Agarwal: Statistical Simulations

Student: Evana Sorfina Mohd Nazri

Slides | Writeup

Prerequisites: STAT 311, programming experience (preferably in R/Python)

This project aims to explore various methods of statistical simulations; their theoretical underpinnings and practical use. We will cover methods of obtaining independent and identically distributed random samples for both continuous and discrete random variable. These include methods like inverse transform, accept-reject, ratio of uniforms, importance sampling etc. During the later parts of the project, we will delve into Markov chain Monte Carlo, a robust method of obtaining correlated random samples from any probability distribution. While MCMC is a rich area in itself (reading about it is highly encouraged), we will cover the two most popular MCMC algorithms - Metropolis-Hastings and Gibbs Sampling. Since simulations is a very programming-centric topic, the project will regularly involve coding the sampling methods covered. These will be short codes for toy examples and will not require very high programming skills.

Michael Cunetta: Sabermetrics

Student: David Wang

Slides | Writeup

Prerequisites: Familiarity with the rules of major league baseball. Some familiarity with R.

We will read excerpts from “The Book: Playing the Percentages in Baseball” (2007) and carry out our own inference (in R) using baseball datasets. By the end of the project, we will understand core sabermetric principles, we will be critical consumers of baseball analysis, and we will be able to ask and answer our own baseball-related research questions. In April, the student and mentor will go on a field trip to T-Mobile Park to cheer on the Mariners.

Nick Irons: Nick Irons

Student: Qianqian (Emma) Yu

Slides | Writeup

Prerequisites: Knowledge of probability at the level of STAT 311 and some familiarity with programming.

Bayesian statistics is a method of modeling data that synthesizes our prior beliefs about the data with the information contained in the sample to estimate model parameters. Rather than a single point estimate of a parameter, the output of a Bayesian model is a “posterior” distribution which captures the uncertainty in our inferences. Bayesian methods are at the heart of many modern data science and machine learning techniques. In this introduction to Bayesian statistics we will cover conditional distributions, Bayes’ theorem, basics of Bayesian modeling, conjugate priors, MCMC sampling, and application to real dataset(s) of interest to the student in R. If time permits, possible further directions include hypothesis testing, linear regression, hierarchical models, Latent Dirichlet Allocation, and the EM algorithm for missing data. The goal of this project is to come away with an understanding of the basic conceptual and technical aspects of Bayesian inference and to get our hands dirty with real and interesting data. Possible data applications include estimating (potentially waning) COVID vaccine efficacy, estimating COVID prevalence over time in Washington state, topic modeling in NLP, or any other dataset of interest to the student.

Nina Galanter: Optimal Treatment Rules: Causal Inference and Statistical Learning

Student: Leah Jia

Slides | Writeup

Prerequisites: Some familiarity with conditional probability, linear regression, and R.

In many biomedical and public health applications of statistics we are interested in determining the best treatment. However, people and their specific situations will vary and in some cases one treatment does not fit all! Instead, we can create a treatment rule which will take in a subject and their variables and predict the best treatment for them. We want to predict this as well as possible, and so we are looking for “optimal” rules. Optimal treatment rules involve both causal inference and statistical learning as we create rules based on estimated treatment effects. This project will first go over causal inference foundations and then explore Q-learning methods for treatment rules, which might include regression, penalized regression, and support vector machines depending on time and the student’s background. We will use R to evaluate the methods with simulated and real data. If there is extra time, we could look into classification-based methods or dynamic treatment regimes.

Sarah Teichman: Multivariate Data Analysis

Student: Huong Ngo

Slides | Writeup

Prerequisites: Stat 311, and linear algebra would be helpful but not necessary

Almost all datasets collected across disciplines are multivariate, which means that multiple variables are measured. Recent technological advances have let researchers collect datasets with hundreds or thousands of variables. Methods from introductory statistics can be used to measure the relationship between a small subset of variables, but new methods are required to consider all of the data simultaneously. Multivariate data analysis is a set of tools to visualize, explore, and make inference about this type of data. In this project, we will use the textbook “An Introduction to Applied Multivariate Analysis with R” to learn about several methods for multivariate data analysis, including principal components analysis, multidimensional scaling, and clustering. We will choose a dataset of interest at the beginning and apply each of our methods to this dataset, leading to a final data analysis and comparison across methods.

Seth Temple: Statistical Genetics I: Pedigrees and Relatedness

Student: Saleh Wehelie

Slides | Writeup

Prerequisites: STAT 311, and some programming experience

We will explore statistical theory and methodology as it applies to the study of (human) heredity. The overarching theme of the readings are (1) to compute measures of relatedness (kinship and inbreeding) and conditional trait (disease) risk based on known family trees and (2) to estimate relatedness given dense SNP or entire genome sequence data. Readings will follow UW emeritus professor Elizabeth Thompson’s monograph “Statistical Inference from Genetic on Pedigrees”. We will cover conditional probabilities, likelihood models, Hardy-Weinberg equilibrium, the expectation-maximization algorithm to infer allele frequencies for the ABO blood group, Wright’s path counting formula, and identity by descent. During meetings we will work through practice exercises; for 1 or 2 meetings we will go through brief hands-on labs using current research software. More details on this recurring DRP may be found here: https://sdtemple.github.io/statgen1.