autumn 2021 Projects


Alex Ziyu Jiang: Clustering and music genre classification

Student: Yitong (Eva) Shan
Slides | Writeup
Prerequisites: Knowledge of probability at the level of Stat 311 or beyond; Some coding experiences, preferably in Python or R; It will be fantastic if you also happen to like listening to music ;)

Have you ever been amazed by the sheer amount of music genres in your Spotify or Apple Music App and would like to know about their differences in a quantitative way? In this project you will learn how to process audio data and use some interesting clustering techniques in machine learning to classify songs into different genres.



Anna Neufeld: Multiple Testing

Student: Cathy Qi
Slides | Writeup
Prerequisites: Stat 311 and some knowledge of R will be helpful, but not required.

In an introductory statistics course, you learn how to obtain a p-value to test a single null hypothesis. These p-values are constructed such that, when the null hypothesis is true, you will make a mistake and reject the null only 5% of the time. In the real world, scientists often wish to test thousands of null hypotheses at once. In this setting, making a mistake on 5% of the hypotheses could lead to a very high number of false discoveries. Multiple testing techniques aim to limit the number of mistakes made over a large set of hypotheses without sacrificing too much power. We will start with a review of hypothesis testing, then discuss the challenges posed by large numbers of hypotheses, and finally learn about modern multiple testing techniques. Towards the end of the quarter, we will apply the techniques we learned to real data.



David Marcano and Daniel Suen: Cluster Analysis

Student: Townson Cocke
Slides | Writeup
Student: Renee Chien
Slides | Writeup
Prerequisites: Basic knowledge of R or Python, statistical background equivalent to STAT 311 is recommended

In many real-world data applications, from medicine to finance, it is of interest to find groups within the data. Clustering is an unsupervised learning approach for separating data into representative groups. How to find and assess the quality of these discovered clusters is a vast area of modern research. In this project, we will survey several popular clustering techniques and utilize them in simulated and real datasets. In particular, we will explore center-based approaches such as the k-means algorithm, dissimilarity-based approaches such as hierarchical clustering, probability-based approaches such as mixture models, and other techniques based on student interest. We will also look at how to assess a given clustering. The topics covered and their depth will develop based on the interest and statistical/mathematical level of the student. We are happy to take two students if more than one person is interested in this project.



Kenny Zhang: Basics of Causal Inference

Student: Qiguang Yan
Slides | Writeup
Prerequisites: STAT 311 level statistics, some familarity with regression is a plus.

“Correlation is not causation” used to prevent statisticians from answering questions like “Will smoking cause Lung cancer?”. However, with the tool of causal inference and the emergence of big data, we are able to answer some of the questions on a firm scientific basis. We can use causal inference to look at a variety of topics including vaccination, genes etc.



Michael Pearce: Voting, Ranking, and Preference Modeling

Student: Carolina Sawyer
Slides | Writeup
Prerequisites: Stat 311 or equivalent

Preference data appears in many forms: voters deciding between candidates in an election, movie critics rating new releases, and search engines ranking web pages, to name a few! However, modeling preferences in a statistical manner can be challenging for a variety of reasons, such as computational difficulties in working with discrete and high-dimensional data. In this project, we will study a variety of models used for preference data, which includes both ranking and scoring models. Understanding challenges and uncertainty in aggregating preferences will be a key focus. Together, we will also carry out an applied project on preference data based on the student’s interests.



Nick Irons: Introduction to Bayesian Data Analysis

Student: Xuweiyi (William) Chen
Slides | Writeup
Prerequisites: Knowledge of expectations and probability distributions at the level of STAT 340-341 and some knowledge of R.

Bayesian statistics is a method of modeling data that synthesizes our prior beliefs about the data with the information contained in the sample to estimate model parameters. Rather than a single point estimate of a parameter, the output of a Bayesian model is a “posterior” distribution which captures the uncertainty in our inferences. Bayesian methods are at the heart of many modern data science and machine learning techniques. In this introduction to Bayesian statistics we will cover conditional distributions, Bayes’ theorem, basics of Bayesian modeling, conjugate priors, MCMC sampling, and application to real dataset(s) of interest to the student in R. If time permits, possible further directions include hypothesis testing, linear regression, hierarchical models, and the EM algorithm for missing data. The goal of this project is to come away with an understanding of the basic conceptual and technical aspects of Bayesian inference and to get our hands dirty with real and interesting data. Possible data applications include estimating (potentially waning) COVID vaccine efficacy, estimating COVID prevalence over time in Washington state, or any other dataset of interest to the student.



Seth Temple: Statistical Genetics I, Pedigrees and Relatedness

Student: Michael Yung
Slides | Writeup
Prerequisites: STAT 311; some programming experience preferred

We will explore statistical theory and methodology as it applies to the study of (human) heredity. The overarching theme of the readings are (1) to compute measures of relatedness (kinship and inbreeding) and conditional trait (disease) risk based on known family trees and (2) to estimate relatedness given dense SNP or entire genome sequence data. Readings will follow UW emeritus professor Elizabeth Thompson’s monograph “Statistical Inference from Genetic on Pedigrees”. We will cover conditional probabilities, likelihood models, Hardy-Weinberg equilibrium, the expectation-maximization algorithm to infer allele frequencies for the ABO blood group, Wright’s path counting formula, and identity by descent. During meetings we will work through practice exercises; for 1 or 2 meetings we will go through brief hands-on labs using current research software. The final project may involve estimating familial relationships among individuals in the 1000 Genomes database and comparing outputs among various statistical software.



Steven Wilkins-Reeves: An Introduction to Causal Inference and Sensitivity Analysis

Student: Hadi Nazirool Bin Yusri
Slides | Writeup
Prerequisites: Stat 311 (would be useful to have familiarity with linear regression)

Randomized controlled trials are often called the “gold standard” for assessing the effect of a treatment on an outcome. However, for many scientific questions, a randomized controlled trial may be either unethical (i.e. you can’t force someone to smoke to figure out whether it causes cancer), or down right impossible (i.e. you can’t assign someone a higher birth weight). Techniques from causal inference can help us to estimate these treatment effect using only observational data, and some identifying assumptions. Sensitivity analysis can tell us how robust our conclusions are to violations of those assumptions. In this project, you will read parts of Causal Inference: A Primer by Judea Pearl, as well as some papers on the topic. A final project may involve analyzing an observational dataset of your choice applying causal inference and sensitivity analysis techniques.



Vydhourie R.T. Thiyageswaran: Graph Clustering

Student: Dawei Wang
Slides | Writeup
Prerequisites: Introductory Linear Algebra (and interest in basic introductory graph theory would be helpful)

We will explore clustering methods in graphs. We will focus on k-means clustering, and spectral clustering. Additionally, we would spend some time looking at applications, by thinking about studies explored in statistical blog entries, for example, in FiveThirtyEight. If there’s interest, we can look into replicating and extending on some of the ideas in these studies.