Winter 2023
Ethan Ancell: Statistics in Neuroscience
Student: David Ye
Prerequisites:
Students should have an understanding of hypothesis testing, as well as familiarity with R and RStudio.
Neuroscience is a fascinating and rapidly moving field enabling us to better understand how the brain works. In the quest for understanding the brain, neuroscientists use special technology in experimental trials to track neuron behavior across time, and pair this data with events occurring during the experiment. Because there is so much data generated from these trials, there are lots of fascinating statistical questions to be answered when analyzing this data. In this directed reading project, students will analyze an example dataset from a real neuroscience experimental trial conducted here at UW to try and answer whether the neurons in a mouse are actually responding to external stimuli in the experiment. Broadly speaking, this directed reading program project will be an excellent opportunity for undergraduate students to try their hands at a real application of statistics in neuroscience, as well as learn about some of the difficulties of conducting hypothesis tests in environments where certain assumptions of classical hypothesis tests are not fully met.
Alex Bank: Cutting-Edge Sports Statistics
Student: Luke VanHouten
Prerequisites:
Experienced with Python or R
Have you ever wondered how an NBA player creates space for their shot? Or how an MLB slugger knows when to crush a fastball? Or maybe you wonder how a soccer player decides where to run when they are away from the ball? This project will be driven by the student and will explore cutting-edge models being used in sports statistics. We will select a research paper from a the Sloan Sports Conference and study the data and math behind the models used in the paper. We will apply the techniques we studied to implement our own model that answers a question of interest. Potential directions include (but are not limited to) spatial models for player positioning, optimizing shot selection, projecting top draft picks, and identifying inefficiencies in Vegas lines. Students who are interested in this project should look through the conference papers from various years at the link below.
https://www.sloansportsconference.com/conference/2022-conference#research-papers
Andrea Boskovic: Proportional Hazards Models
Student: Dante Ramirez
Prerequisites:
Some experience in survival analysis
Researchers in biomedical fields are often interested in the time it takes for a particular outcome of interest to occur, i.e., time to death. Survival models, which relate the time that passes before an event occurs to some covariates, can be used to answer these questions. In this project, we will be investigating a specific type of survival models: proportional hazards models, where a unit increase in a given covariate is multiplicative with respect to the hazard rate.
Vydhourie R.T. Thiyageswaran: Random walks on graphs
Student: Noah McMahon
Prerequisites:
Some basic exposure to probability. We would still properly review basic probability.
We would study what a random walk is, followed by a little bit of graph theory. Finally, we would go over some examples of where thinking about random walks on graphs has been interesting approaches to solving more general problems.
Antonio Olivas: Statistical evaluation of medical tests for classification and prediction
Student: Sephora-Clotilde Zoro
Prerequisites:
None
In medicine, there exist many medical tests for diagnosing a disease or for learning about an individual's prognosis once a diagnosis has been established. However, how do we know how accurately those tests diagnose the diseases they are supposed to diagnose? Also, when there is more than one diagnostic test for the same disease, how do we know which one is better? Moreover, when the diagnostic test corresponds to a continuous variable, how do we know the threshold to differentiate between having or not having the disease?
In this project, we will learn how to evaluate the performance of continuous medical tests using the receiving operating characteristic (ROC) curve. The ROC curve is very popular in medicine because it conveys graphically the performance of the test. Using properties of the ROC curve, we will learn different ways of comparing two or more medical tests, and different ways of choosing the optimal threshold based on the condition of interest.
If time permits and depending on the student's interests, we can also study how to evaluate the performance of a continuous medical test when the performance and optimal threshold depends on other individual characteristics such as age and sex.
Charlie Wolock: Introduction to prediction
Student: Liuyixin Shao
Prerequisites:
Basic familiarity with R, introductory statistics. Some knowledge of regression would be useful.
Many classical statistical methods are focused on learning associations between variables. However, we may also be interested in prediction --- making a guess about an unknown or future outcome on the basis of whatever information we have access to. In this project, we'll learn about the unique challenges of prediction. We'll discuss how to use traditional statistical methods to make predictions and start to explore more modern machine learning techniques. This project will have a strong focus on thoughtful construction and evaluation of prediction models. We will identify an interesting dataset and implement some of our own prediction procedures using R.
Nina Galanter: Introduction to Survival Analysis
Student: Hannah Chiu
Prerequisites:
Some knowledge of R or another programming language, understanding of expected value and conditional probability, some familiarity with linear regression
In medicine and public health, we are often interested in answering questions about the time until an event occurs. For example, what is the median recovery time from some surgery? Or: does a treatment prolong the time until death for patients with a particular cancer? Because of this, Survival Analysis, which works with these time-to-event outcomes, is an important area of Biostatistics. Most time-to-event data is censored - we cannot observe the event for everyone because we lose track of some subjects or something else happens to them. In this project, we will learn about survival analysis methods for censored data, including Kaplan-Meier curves, the Logrank test, and Cox regression. We may cover other topics based on time and student interest. This project will culminate in either a real data analysis using a dataset of the student's choice or a simulation study.
Autumn 2022
Vydhourie R T Thiyageswaran: Stellaris Project
Student: Gaunyi (Victor) He
Prerequisites:
Comfortable/strong programming skills in Python. Interests in games and networks could be useful.
The project will mainly be coding to simulate the process of players in a game on a graph. Here is a more detailed description of the project: https://www.stat.berkeley.edu/~aldous/Research/Stel_project/stellaris_project.html
Yikun Zhang: Introduction to Density-based Clustering and its Applications
Student: Dongfeng Li
Prerequisites:
STAT 311 or STAT 340 or equivalent (knowledge of basic probability and statistics), some familiarity with programming in Python or R, etc.
In many, if not most, practical applications, the available observations do not spread evenly over the data space but are instead grouped into several clusters. This project is designed to investigate how to statistically uncover these clusters from observational (point cloud) data through density-based approaches. Such approaches, unlike the hierarchical clustering and other dissimilarity-based methods, leverage the (estimated) density from the data to define the clusters and do not require any dissimilarity metric in the clustering process. Among the family of density-based clustering approaches, we are planning to focus on mode clustering, during which the density kernel estimator and mean shift algorithm will be reviewed and discussed. Theoretically, we may study the consistency of mode clustering and its connection to the EM algorithm. Practically, we may apply the mode clustering to real-world data and present some interesting scientific analyses. Depending on the student's interest, the project can be either theory-oriented or coding-focused. We are also happy to survey more density-based clustering approaches such as DBSCAN or other clustering methods beyond the density-based domain according to any additional request from the student.
Zhaoqi Li: Introduction to Adaptive Experimental Design
Student: Zilin Huang
Prerequisites:
Either some mathematical maturity at the level of STAT 394, or some familiarity with Python.
Suppose you are in Vegas facing three lottery machines, each with a different probability of winning a prize. You would like to figure out which one wins the most, so you try out these machines. After trying out many times, you start thinking about strategies: should I find the lottery machine that has the highest probability of winning the prize and keep playing that machine, or should I find the best way to play so I could lose the least amount of money in 100 rounds? Surprisingly, these two strategies lead to different answers, and lead to two branches in multi-armed bandits. This field has close applications to large tech companies like Amazon, Google, Meta, etc, and connects between statistics, computer science, and economics. In this project, we will first review some well-known approaches in multi-armed bandits, and either give a broad overview of the latest approaches for adaptive experimental design or conduct some experiments to visualize the power of these methods depending on student's background.
Apara Venkat: Introduction to Causal Discovery
Student: Mandy Zhang
Prerequisites:
Knowledge about probability distributions, conditional independence. Programming experience would be nice, but not required.
In this project, we will take a graphical approach to learn causal relationships between different variables in a system. First, we will learn how to represent causality using Directed Acyclic Graphs (DAGs). We will cover concepts such as d-separation, Markov property, and faithfulness. We will then describe two algorithms to learn causality from observational data. The first is a constraint-based algorithm called PC (named after Peter Spirtes and Clark Glymour who first described it). The second is a score-based algorithm called Greedy Equivalence Search (GES). Then, we will find a real dataset to apply these methods. If time permits, we can explore other ideas such as computational complexity, causal sufficiency, and background knowledge.
Erin Lipman: Bayesian perspectives on probability and statistics
Student: Jennie Jeon
Prerequisites:
Probability at the level of 311, and some programming experience (preferably R)
Many of the methods we focus on in introductory statistics courses, for example confidence intervals and null hypothesis significance testing, come from the “Frequentist” philosophy of statistics which interprets probability as describing the relative frequency of a certain event over repeated trials (ex. if I flip a fair coin 100 times, about 50 of these flips will land on heads). "Bayesian" statistics on the other hand interprets probability as describing our belief and uncertainty about an event (ex. if I flip a coin once, it is equally likely to come up heads or tails). Because the Bayesian perspective views probability in terms of belief, it provides a rigorous framework for updating our belief in light of new data (ex. if I see that my coin lands on heads 100 out of 100 times, I might start to suspect that it is a fake coin where both sides are heads). In this DRP, we will learn how the Bayesian framework allows us to update our beliefs in light of new data and allows us to answer questions that we cannot answer within the frequentist percetive.
Antonio Olivas and Anand Hemmady: Introduction to Survival Analysis
Student: Bao Han Ngo
Student: Nathan Dennis
Prerequisites:
Familiarity with basic probability theory (random variables, distribution functions, expectation)
How can we understand and estimate the length of time that will elapse before some outcome of interest happens? This question is important for a wide range of applications, including (but certainly not limited to) problems in medicine and public health. To answer this question, we use tools from survival analysis. Analyzing survival data comes with a unique set of challenges that distinguish survival analysis from other fields of statistics. The most notable of these challenges is that survival data are often censored, meaning we can't see whether or when the event happened among some observations. We will first see the kind of problems that survival analysis can be used to address, with particular attention to problems involving censoring. We will then explore both parametric (e.g. MLE) and nonparametric (e.g. Kaplan Meier) methods for handling these problems, contrasting these approaches and learning about the advantages and disadvantages of each. Depending on student interest, we may also talk about the Cox regression model. We also plan to see how to compare survival curves with parametric models and the log rank test, and finally we will apply what we have learned to a particular problem (to be chosen in conjunction with the student).
Ellen Graham: Practice and Philosophy of Data Cleaning
Student: Joy Li
Prerequisites:
Basic experience with coding is a plus but not necessary
When doing applied statistics it is often necessary to "clean" data before analyzing them, but the details of cleaning data are often glossed over. However, the choices made during data cleaning can significantly impact the questions that cleaned data can answer. In this project, we'll discuss what it means to "clean" data and prepare it for the next stage of analysis. The project will vary based on student interest, but possible topics include: Frameworks and tools used in practice, ethics of data cleaning, common data structures, scaling tools to large datasets, missing data, and statistical considerations of choices made while cleaning.
Spring 2022
Andrea Boskovic and Harshil Desai: NBA Analytics and Machine Learning
Student: Kobe Sarausad
Student: Pranav Natarajan
Prerequisites:
Some experience in R or Python; some knowledge about basketball
Have you ever wondered how to predict which NBA rookie will become an all star or wondered how teams choose which players to draft? In this project, we will explore NBA data to make a model that predicts something related to basketball. We will start with an introduction to basic machine learning models, learn how to implement models in R or Python, and evaluate the models we've created. Potential directions could include (but are definitely not limited to) ranking players based on box scores and advanced stats, predicting who will be the MVP, or predicting a team's odds of making the playoffs in a given year. We are willing to mentor two students!
Nina Galanter: Optimal Treatment Rules: Causal Inference and Statistical Learning
Student: Max Bi
Prerequisites:
Some familiarity with conditional probability, linear regression, and R
In many biomedical and public health applications of statistics we are interested in determining the best treatment. However, people and their specific situations will vary and in some cases one treatment does not fit all! Instead, we can create a treatment rule which will take in a subject and their variables and predict the best treatment for them. Optimal treatment rules involve both causal inference and statistical learning as we create rules based on estimated treatment effects. This project will first go over causal inference foundations and then explore Q-learning methods for treatment rules, which might include regression, penalized regression, or generalized additive models depending on time and the student's background. We will use R to evaluate the methods with simulated data.
Anna Neufeld and Alan Min: Introduction to Computational Biology
Student: Wei Jun Tan
Student: Iris Zhou
Prerequisites:
Programming experience (preferably in R). Knowledge of probability distributions at the level of Math/Stat 394 or Stat 340 is preferred but not required.
Given massive amounts of data available from next generation genome sequencing, sequence alignment methods are necessary to align genomic reads to reference genomes. Alignment tools make it possible to identify genetic variation and mutation leading to biological discovery.
We plan to work with the textbook "Computational Genome Analysis," by Deonier, Waterman, and Tavare (available for free online). We will start with some background reading on necessary biological context, and then we will read about statistical concepts related to sequence alignment problems that are common in modern computational biology. After gaining this necessary background, we will learn about modern algorithms for sequence alignment. We are hoping to mentor two students!
Reading and Research Opportunity on Voting
Mentors: Prof. Elena Erosheva, Michael Pearce, Prof. Conor Mayo-Wilson
Students: Minghe (Mia) Zhang and Man (Terry) Yuan
Prerequisites:
Prerequisites: Computational skills (R required; other knowledge and experience, e.g., with python, is desirable). Preference given to Statistics and CSE majors and to candidates with interest and possibility to continue with the project in Summer and Fall 2022
In peer review settings, groups or panels of experts are tasked with evaluating submissions such as grant proposals or job candidate materials. For each submission, individual input is often given as a numeric score or a letter grade. The average or median of such scores is often used to summarize the collective opinion of a panel of experts. In this project, we will consider other ways to aggregate expert opinions by drawing a parallel between panel decisions and elections or voting.
All voting procedures have two key features: types of input that are used and how these inputs are aggregated. Examples of voting procedures include majority rule, Borda rule, single transferrable vote, and majority judgement. Voting procedures matter in that a choice of voting procedure can change panel outcomes or which candidate(s) or proposal(s) are preferred. Social choice theory demonstrates that (a) no voting procedure for selection of one out of three or more choices can satisfy simultaneously a small number of natural desiderata (this result is known as Arrow's Impossibility Theorem), that (b) every voting procedure satisfy some desiderata but not others, and that (c) election outcomes can differ depending on what voting system is used. The points (a)-(c) constitute compelling reasons in favor of better understanding the influence of aggregation methods on panel-level outcomes: we will critically assess properties of voting procedures and whether these properties should be required or desired in panel opinion aggregation methods used in peer review. The project will involve applying social choice algorithms (e.g., Borda rule and Majority Judgement) to de-identified data on panel grant peer review.
Antonio Olivas: Estimation for cancer screening models using deconvolution
Student: Jia Zeng
Prerequisites:
Calculus (MATH 126) and exposure to probability theory (STAT 340).
Cancer screening programs are an important component for secondary cancer prevention. To understand the conditions under which a cancer screening program provides the best benefit, mathematical models are used to estimate relevant quantities using information from cancer screening trials.
In the natural history of a cancer, the time to cancer onset (subclinical) and the sojourn/latent time (time between onset and clinical appearance) are two quantities of interest, but impossible to know separately. However, by using a screening tool we obtain some information that allow us to differentiate between these two components.
In this project we will study a mathematical model that uses information at the aggregated level from a cancer screening trial to estimate mean time to onset, mean sojourn time, and sensitivity of the screening test, via the deconvolution formula and maximum likelihood estimation.
Rrita Zejnullahi: Introduction to Human Rights Statistics
Student: Cindy Elder
Prerequisites:
Some exposure to survey sampling and regression analysis.
In this DRP project, we consider the application of statistics methodology to Human Rights. Topics include missing females, criminal justice, violence against women, hunger and poverty. By the end of the project, we will be able to describe ways that statistical methods can be applied to human rights problems and identify areas that need development of new methods. In the first half, we will read and discuss research papers. In the latter half, we will pick a paper to replicate, with the exact choice of topic at student’s discretion. This project will be mostly remote (meetings via zoom!)
Winter 2022
Medha Agarwal: Statistical Simulations
Student: Evana Sorfina Mohd Nazri
Prerequisites:
STAT 311, programming experience (preferably in R/Python)
This project aims to explore various methods of statistical simulations; their theoretical underpinnings and practical use. We will cover methods of obtaining independent and identically distributed random samples for both continuous and discrete random variable. These include methods like inverse transform, accept-reject, ratio of uniforms, importance sampling etc. During the later parts of the project, we will delve into Markov chain Monte Carlo, a robust method of obtaining correlated random samples from any probability distribution. While MCMC is a rich area in itself (reading about it is highly encouraged), we will cover the two most popular MCMC algorithms - Metropolis-Hastings and Gibbs Sampling. Since simulations is a very programming-centric topic, the project will regularly involve coding the sampling methods covered. These will be short codes for toy examples and will not require very high programming skills.
Michael Cunetta: Sabermetrics
Student: David Wang
Prerequisites:
Familiarity with the rules of major league baseball. Some familiarity with R.
We will read excerpts from "The Book: Playing the Percentages in Baseball" (2007) and carry out our own inference (in R) using baseball datasets. By the end of the project, we will understand core sabermetric principles, we will be critical consumers of baseball analysis, and we will be able to ask and answer our own baseball-related research questions. In April, the student and mentor will go on a field trip to T-Mobile Park to cheer on the Mariners.
Nina Galanter: Optimal Treatment Rules: Causal Inference and Statistical Learning
Student: Leah Jia
Prerequisites:
Some familiarity with conditional probability, linear regression, and R.
In many biomedical and public health applications of statistics we are interested in determining the best treatment. However, people and their specific situations will vary and in some cases one treatment does not fit all! Instead, we can create a treatment rule which will take in a subject and their variables and predict the best treatment for them. We want to predict this as well as possible, and so we are looking for "optimal" rules. Optimal treatment rules involve both causal inference and statistical learning as we create rules based on estimated treatment effects. This project will first go over causal inference foundations and then explore Q-learning methods for treatment rules, which might include regression, penalized regression, and support vector machines depending on time and the student's background. We will use R to evaluate the methods with simulated and real data. If there is extra time, we could look into classification-based methods or dynamic treatment regimes.
Jess Kunke: Survey statistics and R
Student: Mekias Kebede
Prerequisites:
The project can be tailored based on the student's background knowledge; some prior exposure to concepts such as mean, variance, and probability would be helpful.
How do you analyze survey data? How do you design a survey to address a research question and account for uncertainty in the process? What goes into designing, conducting and analyzing big government surveys like the census? What kinds of surveys are there? These are some of the questions we can explore together. We can learn about some of the approaches to designing and analyzing surveys, and we can pick a data set to analyze. The exact direction can be tailored based on student interest and background.
Nick Irons: Bayesian Data Analysis
Student: Qianqian (Emma) Yu
Prerequisites:
Knowledge of probability at the level of STAT 311 and some familiarity with programming.
Bayesian statistics is a method of modeling data that synthesizes our prior beliefs about the data with the information contained in the sample to estimate model parameters. Rather than a single point estimate of a parameter, the output of a Bayesian model is a "posterior" distribution which captures the uncertainty in our inferences. Bayesian methods are at the heart of many modern data science and machine learning techniques. In this introduction to Bayesian statistics we will cover conditional distributions, Bayes' theorem, basics of Bayesian modeling, conjugate priors, MCMC sampling, and application to real dataset(s) of interest to the student in R. If time permits, possible further directions include hypothesis testing, linear regression, hierarchical models, Latent Dirichlet Allocation, and the EM algorithm for missing data. The goal of this project is to come away with an understanding of the basic conceptual and technical aspects of Bayesian inference and to get our hands dirty with real and interesting data. Possible data applications include estimating (potentially waning) COVID vaccine efficacy, estimating COVID prevalence over time in Washington state, topic modeling in NLP, or any other dataset of interest to the student.
Erin Lipman: Bayesian perspectives on statistical modeling
Student: Zhengyang (Anthony) Xu
Prerequisites:
Some familiarity with multivariate linear regression will be helpful, as will some familiarity with R. Our project can be either more technical or more conceptual depending on the background and interests of the student.
Many of the methods we focus on in introductory statistics courses, for example confidence intervals and null hypothesis significance testing, come from the “Frequentist” philosophy of statistics. There is another, increasingly popular, philosophy of statistics called “Bayesian” statistics which has its own ways of conceptualizing and analyzing data. Bayesian statistics views parameters in the world (such as the effect of a medical treatment) as random variables rather than as fixed numbers, and it focuses on synthesizing prior evidence about the distribution of a parameter with information contained in the data. The goal of this project is to gain familiarity with statistical modeling from the Bayesian perspective.
Anna Neufeld: Introduction to Clinical Trials
Student: Hisham Bhatti
Prerequisites:
None.
Drawing mainly from the textbook "Fundamentals of Clinical Trials" by Friedman et al., we will learn about the design and analysis of clinical trials, with special attention to statistical considerations and the role of statisticians. Pending the interest of the student, for the final project we will either delve into an advanced statistical topic in clinical trials, or we will do a ``case study" where we learn about a recent/current clinical trial in depth.
Sarah Teichman: Multivariate Data Analysis
Student: Huong Ngo
Prerequisites:
Stat 311, and linear algebra would be helpful but not necessary
Almost all datasets collected across disciplines are multivariate, which means that multiple variables are measured. Recent technological advances have let researchers collect datasets with hundreds or thousands of variables. Methods from introductory statistics can be used to measure the relationship between a small subset of variables, but new methods are required to consider all of the data simultaneously. Multivariate data analysis is a set of tools to visualize, explore, and make inference about this type of data. In this project, we will use the textbook "An Introduction to Applied Multivariate Analysis with R" to learn about several methods for multivariate data analysis, including principal components analysis, multidimensional scaling, and clustering. We will choose a dataset of interest at the beginning and apply each of our methods to this dataset, leading to a final data analysis and comparison across methods.
Seth Temple: Statistical Genetics I: Pedigrees and Relatedness
Student: Saleh Wehelie
Prerequisites:
STAT 311, and some programming experience
We will explore statistical theory and methodology as it applies to the study of (human) heredity. The overarching theme of the readings are (1) to compute measures of relatedness (kinship and inbreeding) and conditional trait (disease) risk based on known family trees and (2) to estimate relatedness given dense SNP or entire genome sequence data. Readings will follow UW emeritus professor Elizabeth Thompson’s monograph “Statistical Inference from Genetic on Pedigrees”. We will cover conditional probabilities, likelihood models, Hardy-Weinberg equilibrium, the expectation-maximization algorithm to infer allele frequencies for the ABO blood group, Wright’s path counting formula, and identity by descent. During meetings we will work through practice exercises; for 1 or 2 meetings we will go through brief hands-on labs using current research software. More details on this recurring DRP may be found here: https://sdtemple.github.io/statgen1.
Drew Wise: Introduction to Nonparametric Statistics
Student: Xinyi (Vicky) Xiang
Prerequisites:
An introductory statistics class is all that's needed. Some programming experience would be a plus.
Many of the methods studied in an introductory statistics class — z-scores and t-tests, for example — rely on assumptions not always met by the data. The purpose of this project is to expose the student to nonparametric statistical tests, a class of techniques that are more broadly applicable. We will begin by discussing the advantages and disadvantages of nonparametric tests, and then we will study the tests themselves: Wilcoxon signed-rank tests, Mann-Whitney U-tests, and Kruskal-Wallis H-tests, among others. There is flexibility in the topics depending on student interest!
Autumn 2021
Nick Irons: Introduction to Bayesian Data Analysis
Student: Xuweiyi (William) Chen
Prerequisites:
Knowledge of expectations and probability distributions at the level of STAT 340-341 and some knowledge of R.
Bayesian statistics is a method of modeling data that synthesizes our prior beliefs about the data with the information contained in the sample to estimate model parameters. Rather than a single point estimate of a parameter, the output of a Bayesian model is a "posterior" distribution which captures the uncertainty in our inferences. Bayesian methods are at the heart of many modern data science and machine learning techniques. In this introduction to Bayesian statistics we will cover conditional distributions, Bayes' theorem, basics of Bayesian modeling, conjugate priors, MCMC sampling, and application to real dataset(s) of interest to the student in R. If time permits, possible further directions include hypothesis testing, linear regression, hierarchical models, and the EM algorithm for missing data. The goal of this project is to come away with an understanding of the basic conceptual and technical aspects of Bayesian inference and to get our hands dirty with real and interesting data. Possible data applications include estimating (potentially waning) COVID vaccine efficacy, estimating COVID prevalence over time in Washington state, or any other dataset of interest to the student.
Alex Ziyu Jiang: Clustering and music genre classification
Student: Yitong (Eva) Shan
Prerequisites: Knowledge of probability at the level of Stat 311 or beyond; Some coding experiences, preferably in Python or R; It will be fantastic if you also happen to like listening to music ;)
Have you ever been amazed by the sheer amount of music genres in your Spotify or Apple Music App and would like to know about their differences in a quantitative way? In this project you will learn how to process audio data and use some interesting clustering techniques in machine learning to classify songs into different genres.
David Marcano and Daniel Suen: Cluster Analysis
Students: Townson Cocke and Renee Chien
Prerequisites: Basic knowledge of R or Python, statistical background equivalent to STAT 311 is recommended
In many real-world data applications, from medicine to finance, it is of interest to find groups within the data. Clustering is an unsupervised learning approach for separating data into representative groups. How to find and assess the quality of these discovered clusters is a vast area of modern research. In this project, we will survey several popular clustering techniques and utilize them in simulated and real datasets. In particular, we will explore center-based approaches such as the k-means algorithm, dissimilarity-based approaches such as hierarchical clustering, probability-based approaches such as mixture models, and other techniques based on student interest. We will also look at how to assess a given clustering. The topics covered and their depth will develop based on the interest and statistical/mathematical level of the student. We are happy to take two students if more than one person is interested in this project.
Anna Neufeld: Multiple Testing
Student: Cathy Qi
Prerequisites:
Stat 311 and some knowledge of R will be helpful, but not required.
In an introductory statistics course, you learn how to obtain a p-value to test a single null hypothesis. These p-values are constructed such that, when the null hypothesis is true, you will make a mistake and reject the null only 5% of the time. In the real world, scientists often wish to test thousands of null hypotheses at once. In this setting, making a mistake on 5% of the hypotheses could lead to a very high number of false discoveries. Multiple testing techniques aim to limit the number of mistakes made over a large set of hypotheses without sacrificing too much power. We will start with a review of hypothesis testing, then discuss the challenges posed by large numbers of hypotheses, and finally learn about modern multiple testing techniques. Towards the end of the quarter, we will apply the techniques we learned to real data.
Michael Pearce: Voting, Ranking, and Preference Modeling
Student: Carolina Sawyer
Prerequisites:
Stat 311 or equivalent
Preference data appears in many forms: voters deciding between candidates in an election, movie critics rating new releases, and search engines ranking web pages, to name a few! However, modeling preferences in a statistical manner can be challenging for a variety of reasons, such as computational difficulties in working with discrete and high-dimensional data. In this project, we will study a variety of models used for preference data, which includes both ranking and scoring models. Understanding challenges and uncertainty in aggregating preferences will be a key focus. Together, we will also carry out an applied project on preference data based on the student's interests.
Seth Temple: Statistical Genetics I, Pedigrees and Relatedness
Student: Michael Yung
Prerequisites:
STAT 311; some programming experience preferred
We will explore statistical theory and methodology as it applies to the study of (human) heredity. The overarching theme of the readings are (1) to compute measures of relatedness (kinship and inbreeding) and conditional trait (disease) risk based on known family trees and (2) to estimate relatedness given dense SNP or entire genome sequence data. Readings will follow UW emeritus professor Elizabeth Thompson’s monograph “Statistical Inference from Genetic on Pedigrees”. We will cover conditional probabilities, likelihood models, Hardy-Weinberg equilibrium, the expectation-maximization algorithm to infer allele frequencies for the ABO blood group, Wright’s path counting formula, and identity by descent. During meetings we will work through practice exercises; for 1 or 2 meetings we will go through brief hands-on labs using current research software. The final project may involve estimating familial relationships among individuals in the 1000 Genomes database and comparing outputs among various statistical software.
Vydhourie R.T. Thiyageswaran: Graph Clustering
Student: Dawei Wang
Prerequisites:
Introductory Linear Algebra (and interest in basic introductory graph theory would be helpful)
We will explore clustering methods in graphs. We will focus on k-means clustering, and spectral clustering. Additionally, we would spend some time looking at applications, by thinking about studies explored in statistical blog entries, for example, in FiveThirtyEight. If there’s interest, we can look into replicating and extending on some of the ideas in these studies.
Steven Wilkins-Reeves: An Introduction to Causal Inference and Sensitivity Analysis
Student: Hadi Nazirool Bin Yusri
Prerequisites:
Stat 311 (would be useful to have familiarity with linear regression)
Randomized controlled trials are often called the “gold standard” for assessing the effect of a treatment on an outcome. However, for many scientific questions, a randomized controlled trial may be either unethical (i.e. you can’t force someone to smoke to figure out whether it causes cancer), or down right impossible (i.e. you can’t assign someone a higher birth weight). Techniques from causal inference can help us to estimate these treatment effect using only observational data, and some identifying assumptions. Sensitivity analysis can tell us how robust our conclusions are to violations of those assumptions. In this project, you will read parts of Causal Inference: A Primer by Judea Pearl, as well as some papers on the topic. A final project may involve analyzing an observational dataset of your choice applying causal inference and sensitivity analysis techniques.
Kenny Zhang: Basics of Causal Inference
Student: Qiguang Yan
Prerequisites:
STAT 311 level statistics, some familarity with regression is a plus.
"Correlation is not causation" used to prevent statisticians from answering questions like "Will smoking cause Lung cancer?". However, with the tool of causal inference and the emergence of big data, we are able to answer some of the questions on a firm scientific basis. We can use causal inference to look at a variety of topics including vaccination, genes etc.
Spring 2021
Peter Gao: Ethics of Algorithmic Decision Making
Student: Kevin Hoang
Prerequisites:
None
In this project, we'll discuss ethical issues arising from the use of algorithms in decision making, in fields like medicine, policing, and housing. We'll talk about issues ranging from algorithmic bias and disparate impact to data privacy. Finally, we'll introduce statistical definitions of fairness and talk about their benefits and shortcomings. If there's interest, we can work on simulations/data analysis to evaluate statistical definitions of fairness.
Alex Ziyu Jiang: Sampling methods, Markov Chain Monte Carlo and Cryptography
Student: Kathleen Cayha
Prerequisites:
Basic knowledge of probability is recommended (STAT 311 level). Some prior coding experience with R would be great but not necessary
In this project we learn how to decipher coded messages with the widely-used Markov Chain Monte Carlo method. We will first go through the basics of Markov Chain model after a quick probability warm-up. After that we will learn how to generate samples from a known distribution using a wide range of techniques. Finally, we will apply our tools to a dataset consisting of coded messages and we will see how the 'messy' code will gradually iterate into complete sentences using what we have learned.
Alan Min and Anupreet Porwal: Expectations and Sampling methods
Students: Kai Gong and Aubrey Yan
Prerequisites:
STAT 340-341 and some knowledge of R ; Basics of expectations and probability distributions.
Expectation of a random variable or functions of random variables can be difficult to compute analytically when the probability distribution of those variables are not standard well known distributions. One way to approximate expectations is by “intelligently” drawing samples from the probability distributions. In this project, we will cover several sampling and Monte Carlo methods to draw samples from “difficult distributions” and use these samples to approximate expectations. Particularly, we will look at transformation based sampling, importance sampling, rejection sampling and their popular variants. Finally, we will compare performances of these sampling methods and apply the methodology to a dataset of interest to the student in a Bayesian analysis. We are happy to take two students if more than one person is interested in this project.
Anna Neufeld: Infectious disease modeling
Student: Kayla Kenyon
Slides ,
Writeup
Prerequisites:
Some comfort in R; experience with calculus and differential equations will be useful but not required.
We will start by reading introductory material on SIR compartmental models for disease modeling, and will work to implement these models in R. These are deterministic differential equation models whose output depends on knowledge of various input parameters. After becoming comfortable with the models, we will discuss how statisticians estimate the parameters of these models using current outbreak data in the face of uncertainty, and how the models are then used for predictions and forecasting. The project will evolve based on the interest and statistical level of the student, but could potentially culminate in an applied COVID-19 modeling project.
Michael Pearce: Nonlinear Regression
Student: Muhammad Anas
Slides ,
Writeup
Prerequisites:
A basic knowledge of linear regression and some experience in R
Simple linear regression models can be easy to implement and interpret, but they don't always fit data well! For this project, we'll explore regression methods that relax the assumption of linearity. These might include (based on the interest and/or experience level of the student) polynomial regression, step functions, regression splines, smoothing splines, multivariate adaptive regression splines, and generalized additive models. Hopefully, we'll even see how to validate such models using cross-validation! We will mostly use James et al.'s "An Introduction to Statistical Learning" Chapter 7.
Taylor Okonek: Disease Mapping
Student: Wuwei Zhang
Slides ,
Writeup
Prerequisites:
STAT 340; Interest in public health applications; Familiarity with R
Disease mapping is an important tool for visualizing spatial data on the prevalence and/or incidence of various diseases. In this project, we’ll discuss different types of spatial data, and explore visualization techniques and their usefulness in conveying relevant information. In particular, we’ll discuss ways to visualize uncertainty in disease mapping, how estimates underlying a disease map inform public health policy, issues with data sparsity and spatial aggregation, and how to obtain the estimates that underlie such maps. We’ll learn about Bayesian hierarchical models and, time permitting, spatial random effect terms. Throughout, we’ll explore concepts using real data from various diseases that are of student interest.
Sarah Teichman: Ethics of Algorithmic Decision Making
Student: Liwen Peng
Slides ,
Writeup
Prerequisites:
None
In this project, we'll discuss ethical issues arising from the use of algorithms in decision making, in fields like medicine, policing, and housing. We'll talk about issues ranging from algorithmic bias and disparate impact to data privacy. Finally, we'll introduce statistical definitions of fairness and talk about their benefits and shortcomings. If there's interest, we can work on simulations/data analysis to evaluate statistical definitions of fairness.
Apara Venkat: Networks and Choice Modeling
Student: Andrey Risukhin
Slides ,
Writeup
Prerequisites:
Calculus (MATH 126) and exposure to probability theory (STAT 340). Linear Algebra (MATH 308) suggested, but not necessary.
A general interest and curiosity about math and the world.
Imagine a grocery store that presents its customers with a multitude of cereal options. A rational customer would want to maximize their utility. How do you define the "utility" of an item? How do you model the decisions of a customer when there are random effects?
If we have data from decisions made by customers, can we identify the
utility of various cereals? Discrete choice models attempt to answer these questions.
In the language of networks, this problem is closely related to ranking.
How does Google rank the webpages? How do we rank players in sports tournaments?
Recent developments have been unifying these fields.
In this project, we will wrestle with these questions. First, we will learn about discrete choice models.
Then, we will learn about ranking in networks. Finally we will attempt to reconcile the two.
Time permitting, we will also run simulations and look at datasets along the way.
Winter 2021
Peter Gao: Survey of Data Journalism
Student: Rohini Mettu
Slides ,
Writeup
Prerequisites:
None
In this project, we'll take a look at recent uses of data and statistics in journalism and discuss their effectiveness in applying statistical methods and communicating results to their readers. If desired, we can focus on a specific area of application (climate change, economics, sports, epidemiology). If there is interest, we can look into replicating and extending a particular example.
Richard Guo: Making probability rigorous
Student: Mark Lamin
Prerequisites:
Probability theory at the level of MATH/STAT 394 and 395
Having sat through the introductory probability course, likely you have heard things like "Lebesgue measure", "sigma algebra", "almost sure convergence" and even "martingale" being mentioned. Do you wonder what they are and why they matter? This is a reading program that will introduce these notions and make the probability you learned *rigorous*. We will read together the acclaimed monograph "Probability with Martingales" by David Williams. Rigorous treatment of probability and measure theory will prepare you for more advanced topics, such as stochastic processes, learning theory and theoretical statistics.
Bryan Martin: Statistical Learning with Sparsity
Student: Jerry Su
Prerequisites:
Familiarity with regression, up to a STAT 311 level
Many modern applications benefit from the principle of less is more. Whether due to practical computation concerns from big data, overfitting concerns from too many parameters, or estimability concerns from a small sample size, statistical models often require sparsity. Sparsity can improve our predictions, help make the patterns we observe in our data more reproducible, and give our model parameters desirable properties.
Often, sparsity is imposed through penalization, where we include a term in our model to enforce that some parameters are set equal to zero. We will learn about some of the statistical theory underlying how penalization works, and how it impacts our model output, both mathematically and computationally. We will also learn about and compare different sparsity schemes, such as lasso, group lasso, elastic net, and more. We will focus on understanding the different settings in which we might be interested in different forms of sparsity and apply these tools to real data.
Eric Morenz and Yiqun Chen: See what's not there
Student: Suh Young Choi
Slides ,
Writeup
Prerequisites:
Experience with linear regression, probability, or data manipulation will allow a deep dive into the content.
It is not a requirement for students who are interested in the subject.
In this project, we will take a look at the concept of identification in the context of missing data (and causal inference, if time permits or there is interest!!).
While no glamorous artificial intelligence buzzwords are involved in the project per se,
remember that your model is just as good as your data (and as we will see by the end of the
quarter, as good as your identification assumptions!). We will be drawing from various sources
(e.g., Chapter 6 in Foundations of Agonistic Statistics) in the hope of
flexible schedule/materials given your background and interest. We will consider a
few empirical problems as well, from TidyTuesday data sets to political polls.
Taylor Okonek: Topics in Biostatistics
Student: Anna Elias-Warren
Slides ,
Writeup
Prerequisites:
Introductory statistics a plus but not required, interest in public health applications
In this project, we’ll first broadly discuss some of the main pillars of the field of biostatistics, and then focus on a more specific topic for the remainder of the quarter. The main pillars we'll discuss include design of clinical trials, survival analysis, and infectious disease modeling. The focused part of this project can be greatly driven by student interest. Possible directions include: doing a deep-dive into the design of COVID-19 vaccine trials; reading articles about and discussing implications of communicating public health analyses to the public; gaining a broad understanding of how infectious disease models have influenced policy throughout the world; reading about ethical issues in global health studies; and more!
Michael Pearce: Nonlinear Regression
Student: Alejandro Gonzalez
Slides ,
Writeup
Prerequisites:
A basic knowledge of linear regression and some experience in R
Simple linear regression models can be easy to implement and interpret, but they don't always fit data well! For this project, we'll explore regression methods that relax the assumption of linearity. These might include (based on the interest and/or experience level of the student) polynomial regression, step functions, regression splines, smoothing splines, multivariate adaptive regression splines, and generalized additive models. Hopefully, we'll even see how to validate such models using cross-validation! We will mostly use James et al.'s "An Introduction to Statistical Learning" Chapter 7.
Sarah Teichman: Multivariate Data Analysis
Student: Lindsey Gao
Slides ,
Writeup
Prerequisites:
Stat 311, and linear algebra would be helpful but not necessary
Almost all datasets collected across disciplines are multivariate, which means that multiple variables are measured. Recent technological advances have let researchers collect datasets with hundreds or thousands of variables. Methods from introductory statistics can be used to measure the relationship between a small subset of variables, but new methods are required to consider all of the data simultaneously. Multivariate data analysis is a set of tools to visualize, explore, and make inference about this type of data. In this project, we will use the textbook "An Introduction to Applied Multivariate Analysis with R" to learn about several methods for multivariate data analysis, including principal components analysis, multidimensional scaling, and clustering. We will choose a dataset of interest at the beginning and apply each of our methods to this dataset, leading to a final data analysis and comparison across methods.
Seth Temple: Statistical Genetics and Identity by Descent
Student: Selma Chihab
Slides ,
Writeup
Prerequisites:
STAT 311; some programming experience preferred
We will explore statistical theory and methodology as it applies to the study of (human) heredity. The overarching theme of the readings are (1) to compute measures of relatedness (kinship and inbreeding) and conditional trait (disease) risk based on known family trees and (2) to estimate relatedness given dense SNP or entire genome sequence data. Readings will follow UW emeritus professor Elizabeth Thompson’s monograph “Statistical Inference from Genetic on Pedigrees”. We will cover conditional probabilities, likelihood models, Hardy-Weinberg equilibrium, the expectation-maximization algorithm to infer allele frequencies for the ABO blood group, Wright’s path counting formula, and identity by descent. During meetings we will work through practice exercises; for 1 or 2 meetings we will go through brief hands-on labs using current research software
Apara Venkat: Networks and Choice Modeling
Student: Xuling Yang
Slides ,
Writeup
Prerequisites:
Calculus (MATH 126) and exposure to probability theory (STAT 340). Linear Algebra (MATH 308) suggested, but not necessary.
A general interest and curiosity about math and the world.
Imagine a grocery store that presents its customers with a multitude of cereal options. A rational customer would want to maximize their utility. How do you define the "utility" of an item? How do you model the decisions of a customer when there are random effects?
If we have data from decisions made by customers, can we identify the
utility of various cereals? Discrete choice models attempt to answer these questions.
In the language of networks, this problem is closely related to ranking.
How does Google rank the webpages? How do we rank players in sports tournaments?
Recent developments have been unifying these fields.
In this project, we will wrestle with these questions. First, we will learn about discrete choice models.
Then, we will learn about ranking in networks. Finally we will attempt to reconcile the two.
Time permitting, we will also run simulations and look at datasets along the way.
Jerry Wei: Topological Data Analysis
Student: Joia Zhang
Slides ,
Writeup
Prerequisites:
Exposure to probability theory and linear algebra
Topological Data Analysis (TDA) broadly is about data analysis methods that find structure in data. This includes a lot of topics, and we will focus on clustering and nonlinear dimension reduction. We will study some textbook chapters and some classical papers.
We may also go into mode estimation and manifold estimation if interested.
Kenny Zhang: Deep Learning for Computer Vision
Student: Angela Zhao
Slides ,
Writeup
Prerequisites:
Proficiency in a programming language (preferably python). Some exposure in basic probability rules and computer science would be helpful.
Image data is prevalent in our lives and modern deep learning provides a powerful
tool to deal with image data. We will start with logistic regression and study what is a
neural network. Then we will move on to convolutional neural network and some coding exercises.
If time allowed, we can delve more into the state-of-the-art Generative Adversarial Networks
(GAN) and more complicated tasks like segmentation.
Autumn 2020
Peter Gao: Statistics for Data Journalism: Election Forecasting
Student: Andy Qin
Slides
Prerequisites:
Experience with introductory stats (at the level of any of the intro classes)
would help.
In this project, we'll take a look at how leading
newspapers and researchers conduct polls,
forecast elections, and calculate polling averages. If there is interest, we can work on reverse
engineering some of the methods used by publications such as FiveThirtyEight, RealClearPolitics, and the Upshot.
Finally, we'll consider the ethics of forecasting elections and using statistics in general to study our election process.
Zhaoqi Li: Statistical Illusions
Student: Yeji Sohn
Slides
Prerequisites: Motivation to think about interesting problems and readiness for the brain to be teased. Some mathematical maturity would be beneficial.
Do you know that there is a “statistically significant” relationship between your salary and if you pee at night? Do you know that you will always wait longer than others at a bus stop? Do you know that a lot of the statistical concepts you learned in class actually don’t make sense? In this quarter-long study, we will dive into some common misconceptions about statistics and the questions of how to interpret statistics. We will touch on a wide range of statistical topics from a paradoxical view and learn the intuition behind them.
No prior knowledge of statistics is required but motivation is encouraged.
Shane Lubold: Random Network Models
Student: Peter Liu
Slides
Prerequisites: Intro statistics and some programming experience (R or Python).
Network data, which consists of edges or relationships between nodes, is an important type of data. Many statistical models have been proposed to understand and model this type of data. Some are simple models which assume that all actors form connections with the same probability, while others are more complicated and use node-specific characteristics to determine the probability of an edge. In this project we will first review common network models (such as the Erdos-Renyi model, stochastic block model, and latent space model) and discuss why they might be useful in practice. We will then fit these models to data sets from the Stanford Network Analysis Project and discuss why some models fit better than others. The goal of the project is to understand how network data can arise in the real world and how network properties determine which models are reasonable. If we have time, we can also look at dynamic networks (networks that change over time) and see if we can model them using any of the models discussed above.
Bryan Martin: Ethics in Data Science and Statistics
Student: Jinghua Sun
Slides
Prerequisites:
None
In this project, we will discuss ethical questions and issues that arise in the field of statistics and data science. We will read case studies and work together to develop a lesson that can be taught to introductory statistics students as part of an undergraduate curriculum. By the end of this project, I hope to have material that I will use in my own courses! Topics will be driven by the student's particular interest, but possible topics will include: the history of statistics and eugenics, race and gender in data science, algorithmic fairness, reproducibility and open science, data transparency, privacy, and human subjects data.
Ronak Mehta: The Magical Properties of the SVD
Student: Claire Gao
Prerequisites:
Linear Algebra (Math 308 or equivalent). Some statistical background, preferably at
the level of 340.
The singular value decomposition (SVD) of a matrix has wide relevance in virtually all areas of applied mathematics. This project will consist of three parts:
1) a theory section containing proofs of intriguing properties about the SVD and derivations of three problems of wide importance in statistics and machine learning: principle components analysis (PCA), partial least squares (PLS), and canonical correlations analysis (CCA), all of whose solutions depend heavily on the SVD.
2) a simulation section demonstrating the bias-variance tradeoff of using on method over another for various regression/classification tasks.
3) a real data section in which the student interprets the features learned by these methods on a dataset of their choice.
This project is ideal for intermediate statistics students who want to make their linear algebra skills airtight and have a strong mathematical foundation for future success in machine learning.
Anna Neufeld: Infectious Disease Modeling
Student: Harper Zhu
Slides ,
Shiny App
Prerequisites:
Some comfort in R; experience with calculus and differential equations will be useful but not required.
We will start by reading introductory material on SIR compartmental models for disease modeling, and will work to implement these models in R. These are deterministic differential equation models whose output depends on knowledge of various input parameters. After becoming comfortable with the models, we will discuss how statisticians estimate the parameters of these models using current outbreak data in the face of uncertainty, and how the models are then used for predictions and forecasting. The project will evolve based on the interest and statistical level of the student, but could potentially culminate in an applied COVID-19 modeling project.
Michael Pearce: History and Practice of Data Communication
Student: Ziyi Li
Writeup
Prerequisites:
None; some experience with R or Python may be helpful but is not required.
In this course, we'll learn about the development of data communication techniques and their modern use. We'll begin by studying how people have visualized patterns in data over time, and consider how those methods reflected the computational resources available in each era. Then, we'll shift our attention to modern issues in data communication, drawing examples from the COVID-19 pandemic and 2020 US presidential election: How do practitioners effectively show complex relationships or model uncertainty? How do people mislead readers through text and figures (intentionally or otherwise)? What common pitfalls exist, and how can we avoid them? We'll finish with a data communication project based on the student's interests.
Subodh Selukar: Introduction to Survival Analysis
Student: Howard Baek
Writeup ,
Shiny App
Prerequisites:
Familiarity with R; familiarity with survival analysis
In many applications, researchers are interested in the time it
takes for an outcome of interest to occur: for example, time to death by
any cause ("overall survival") is the gold standard outcome for studies in many
biomedical fields. Among other characteristics, these data exhibit a special kind of
missingness termed "censoring," which requires the use of different statistical methods
than other data types. In this project, the student will learn about the characteristics of
time-to-event (or "survival") data and basic methods for approaching these data.
Sarah Teichman: Phylogenetic Trees
Student: Lexi Xia
Writeup
Prerequisites:
An intro stats class. Some R experience is useful but not required.
In this project, we will learn about the application of statistics to evolutionary biology through working with phylogenetic trees. In evolutionary biology, a diagram in the form of a tree is often used to represent the diversification of species over time. In this project, we'll read chapters from the book \emph{Tree Thinking} and choose a dataset to investigate deeply in R in order to understand phylogenies: what they are, how they are used, and how we can use statistics to construct them and test hypotheses about evolution.
Seth Temple: Statistical Genetics and Identity by Descent
Student: Rachel Ferina
Slides Writeup
Prerequisites:
None; keen interest in the biological sciences
We will explore many classical ways in which statistics has been employed to study heredity in humans and other organisms. For example, we will introduce the expectation-maximization algorithm to infer allele frequencies for ABO blood types and discuss Jacquard’s 9 condensed states of identity by descent. This tutorial will be very practical as we will draw many family trees to compute kinship and inbreeding coefficients. We will use UW emeritus professor Elizabeth Thompson’s monograph "Statistical Inference from Genetic Data on Pedigrees" as reading material. Depending on student interest, we may read more chapters from Thompson’s book, investigate the history of statistical genetics as it relates to the eugenics movement, or code up some computations like the path counting formula.
Spring 2020
We ran a limited number of projects due to COVID-19.
Sheridan Grant: Causal Inference: DAGs and Potential Outcomes
Student: Grace Shen
Slides
Prerequisites: Familiarity with linear regression and facility with
Gaussian distributions (preferably multivariate)
This project will be reading-focused, rather than data analysis.
It's intended for a junior or senior student who is interested in learning about Causal Inference--a
huge topic in graduate-level statistics and stats research--perhaps as a prelude
to applying for PhDs. You'll read classic papers and parts of textbooks on two approaches
to causal inference, potential outcomes & graphs. For the final presentation,
you'll contrast the two approaches as applied to a problem (practical or theoretical) of your choice.
Shane Lubold: Random Graphs
Student: Gordon An
Prerequisites: Some exposure to probability. Some exposure to, or an interest in, graph theory.
In this project, we will study random graph theory and how the behavior of these graphs change as the size of the graph grows. We will focus primarily on a simple graph model with a number of interesting properties, the Erdös-Rényi model. In this simple model, we generate a graph on n nodes, where each node connects to any other node with probability p(n), which can depend on the graph size n. We will use theory and simulations to derive key properties of this model, such as the distribution of the degree of a vertex or the number of cliques of any size. We will also explore other exciting properties of this model. For example, if the ratio p*n grows at a certain rate as n gets big, then the graph will, for example, exhibit large cliques with probability 1. The proof of these ideas uses only basic statistical ideas. We will prove the conditions that guarantee this behavior and use simulations to explore how large the graphs must be the see this behavior. This project will expose students to the exciting field of random graphs and will give them a good understanding of how simple statistical tools can answer complex questions.
Anna Neufeld: Disease Modeling
Student: Rachael Ren
Writeup ,
Slides
Prerequisites: Knowledge of R will be useful!
We will start by reading introductory material on SIR compartmental models for disease
modeling, and will work to implement these models in R. These are deterministic differential equation models
whose output depends on knowledge of various input parameters. After becoming comfortable with the models,
we will discuss how statisticians estimate the parameters of these
models using current outbreak data in the face of uncertainty, and how the models are then used for predictions and forecasting.
The project will evolve based on the interest of the student and relevant currnt events.
Winter 2020
Peter Gao: Introduction to Gaussian Processes
Student: Hannah Chang
Prerequisites: None; interest in programming encouraged
As a concept, the Gaussian distribution, often referred to as the normal distribution or the bell curve, has cemented itself in the public consciousness. But what about its finite dimensional generalization, the multivariate Gaussian? Or its infinite dimensional counterpart, the Gaussian process? This project has two main aims: first, to discuss and explore how Gaussian processes arise in various subfields like machine learning and spatial statistics; and second, to develop notes (or a website) that explain Gaussian processes to a general audience. Of course, the exact focus of this project is flexible, based on the reader's interests/background.
Kristof Glauninger: Nonparametric Regression
Student: Eli Grosman
Writeup
Prerequisites: Familiarity with linear regression and basic probability, comfort with algebra, some calculus
Nonparametric statistical methods
have seen an explosion in popularity as datasets have increased in size and complexity.
The goal of this project will be to introduce students who are familiar with parametric regression
models to a nonparametric setting. We will explore some of the basic theory and applications of these models,
as well as an interesting case where we can achieve parametric convergence rates in a nonparametric setting.
Zhaoqi Li: Statistical Machine Learning and Data Analysis
Student: Zhijun Peng
Writeup
Prerequisites Knowledge of probability theory and Maximum Likelihood Estimation at the level of Stat 340 is preferred;
some familiarity of basic programming is preferred;
an enthusiasm of reading and experimenting is encouraged.
We will discuss the relationship between statistics and machine learning, one of the most popular fields in the world,
and how statistical techniques could be used in the machine learning framework.
Topics may include classifiers (e.g., Decision Tree, Naive Bayes),
training (what is training and the relation to likelihood inference), etc.
The design could range from experimental to theoretical,
depending on the background of the student.
Shane Lubold: Random Graphs
Student: Tahmin Talukder
Prerequisites: Some exposure to probability. Some exposure to, or an interest in, graph theory.
In this project, we will study random graph theory and how the behavior of these graphs change as the size of the graph grows. We will focus primarily on a simple graph model with a number of interesting properties, the Erdös-Rényi model. In this simple model, we generate a graph on n nodes, where each node connects to any other node with probability p(n), which can depend on the graph size n. We will use theory and simulations to derive key properties of this model, such as the distribution of the degree of a vertex or the number of cliques of any size. We will also explore other exciting properties of this model. For example, if the ratio p*n grows at a certain rate as n gets big, then the graph will, for example, exhibit large cliques with probability 1. The proof of these ideas uses only basic statistical ideas. We will prove the conditions that guarantee this behavior and use simulations to explore how large the graphs must be the see this behavior. This project will expose students to the exciting field of random graphs and will give them a good understanding of how simple statistical tools can answer complex questions.
Bryan Martin: R Package Development
Student: Thomas Serrano
Writeup
Prerequisites: Familiarity with R
Reproducible statistical analysis depends on good software and coding practices. In this project, we will learn how to go from users of R packages to developers of R packages. We will also practice and implement general software developer skills, including documentation, version control, and unit testing. We will learn how to make our code robust, efficient, and user-friendly. Ideally, you will start with an idea of something you are interested in implementing as an R package, whether it be a statistical model, data analysis application, or anything else, though this is not required!
Anna Neufeld: Statistical Natural Language Processing
Student: Christina Nick
Prerequisites: Proficiency in a programming language. Knowledge of basic
probability rules at the level of Stat 311.
In most statistics classes, the data you work with are numbers. Text documents
such as books, articles, and speeches provide massive sources of data that can not be analyzed using the tools from your introductory statistics
courses. We will explore the field of statistical natural language processing and discuss classification and clustering techniques for text data.
Applications of such techniques include translation, information retrieval, fake news detection, and sentiment analysis.
After reviewing the literature to get a sense of the general techniques in NLP, we will select a particular
text dataset and research question and work on an applied project.
Michael Pearce: Nonlinear Regression
Student: Oliver Bejar Tjalve
Writeup
Prerequisites: A basic knowledge of linear regression and some experience in R
Simple linear regression models can be easy to implement and interpret,
but they don't always fit data well!
For this project, we'll explore regression methods that relax the assumption of linearity.
These might include (based on the interest and/or experience level of the student)
polynomial regression, step functions, regression splines, smoothing splines,
multivariate adaptive regression splines, and generalized additive models.
Hopefully, we'll even see how to validate such models using cross-validation!
We will mostly use James et al.'s "An Introduction to Statistical Learning" Chapter 7.
Anupreet Porwal: Bayesian Linear regression and applications
Student: Yuchen Sun
Writeup
Prerequisites: Basic knowledge of probability distributions at the level of
Stat 394 or Stat 340. Knowledge of Linear Algebra is essential for this project. Familiarity with a programming language may be helpful.
Often when we fit models to practical applications, we have some prior understanding of the context of the problem/field which could potentially be useful to tune our model along with the data. For example, if you are trying to model the reply times of emails from dept. chair to professors, information about the designation of professors (full-time/assistant) can be helpful information.
Bayesian statistics provides a formal way to incorporate our prior beliefs and information into the model and is particularly useful as it accurately helps to quantify the uncertainty in our inferences. In this project, we wish to discuss basics of Bayes theorem, Bayesian version of Linear regression and if time permits, we will learn about probabilistic matrix factorization (Recommendation systems) and apply these techniques to an interesting problem.
Sarah Teichman: Networks
Student: Josiah Thulin
Writeup
Prerequisites: Stat 311. Some R is useful
but not required.
Most of the data that you see in STAT 311 are assumed to be independent.
However, a lot of interesting datasets include information about individual observations
and the relationships between them. This type of data can be analyzed as networks, in which nodes
represent individuals and edges represent relationships between them.
Networks can be used to study interactions between social groups,
the spread of contagious diseases, biological cycles, etc.
We will use the textbook "Statistical Analysis of Network Data," along with it's companion text
"Statistical Analysis of Network Data in R" by Eric D. Kolaczyk. We will additionally read one or
two papers about an application of network analysis and/or analyze a small
network in R (based on interest of the student).