autumn 2022 Projects

Antonio Olivas and Anand Hemmady: Introduction to Survival Analysis

Student: Bao Han Ngo

Student: Nathan Dennis

Slides | Writeup

Prerequisites: Familiarity with basic probability theory (random variables, distribution functions, expectation)

How can we understand and estimate the length of time that will elapse before some outcome of interest happens? This question is important for a wide range of applications, including (but certainly not limited to) problems in medicine and public health. To answer this question, we use tools from survival analysis. Analyzing survival data comes with a unique set of challenges that distinguish survival analysis from other fields of statistics. The most notable of these challenges is that survival data are often censored, meaning we can’t see whether or when the event happened among some observations. We will first see the kind of problems that survival analysis can be used to address, with particular attention to problems involving censoring. We will then explore both parametric (e.g. MLE) and nonparametric (e.g. Kaplan Meier) methods for handling these problems, contrasting these approaches and learning about the advantages and disadvantages of each. Depending on student interest, we may also talk about the Cox regression model. We also plan to see how to compare survival curves with parametric models and the log rank test, and finally we will apply what we have learned to a particular problem (to be chosen in conjunction with the student).

Apara Venkat: Introduction to Causal Discovery

Student: Mandy Zhang

Slides | Writeup

Prerequisites: Knowledge about probability distributions, conditional independence. Programming experience would be nice, but not required.

In this project, we will take a graphical approach to learn causal relationships between different variables in a system. First, we will learn how to represent causality using Directed Acyclic Graphs (DAGs). We will cover concepts such as d-separation, Markov property, and faithfulness. We will then describe two algorithms to learn causality from observational data. The first is a constraint-based algorithm called PC (named after Peter Spirtes and Clark Glymour who first described it). The second is a score-based algorithm called Greedy Equivalence Search (GES). Then, we will find a real dataset to apply these methods. If time permits, we can explore other ideas such as computational complexity, causal sufficiency, and background knowledge.

Ellen Graham: Practice and Philosophy of Data Cleaning

Student: Joy Li

Slides | Writeup

Prerequisites: Basic experience with coding is a plus but not necessary

When doing applied statistics it is often necessary to “clean” data before analyzing them, but the details of cleaning data are often glossed over. However, the choices made during data cleaning can significantly impact the questions that cleaned data can answer. In this project, we’ll discuss what it means to “clean” data and prepare it for the next stage of analysis. The project will vary based on student interest, but possible topics include: Frameworks and tools used in practice, ethics of data cleaning, common data structures, scaling tools to large datasets, missing data, and statistical considerations of choices made while cleaning.

Erin Lipman: Bayesian perspectives on probability and statistics

Student: Jennie Jeon

Slides | Writeup

Prerequisites: Probability at the level of 311, and some programming experience (preferably R)

Many of the methods we focus on in introductory statistics courses, for example confidence intervals and null hypothesis significance testing, come from the “Frequentist” philosophy of statistics which interprets probability as describing the relative frequency of a certain event over repeated trials (ex. if I flip a fair coin 100 times, about 50 of these flips will land on heads). “Bayesian” statistics on the other hand interprets probability as describing our belief and uncertainty about an event (ex. if I flip a coin once, it is equally likely to come up heads or tails). Because the Bayesian perspective views probability in terms of belief, it provides a rigorous framework for updating our belief in light of new data (ex. if I see that my coin lands on heads 100 out of 100 times, I might start to suspect that it is a fake coin where both sides are heads). In this DRP, we will learn how the Bayesian framework allows us to update our beliefs in light of new data and allows us to answer questions that we cannot answer within the frequentist percetive.

Vydhourie R T Thiyageswaran: Stellaris Project

Student: Gaunyi (Victor) He

Slides | Writeup

Prerequisites: Comfortable/strong programming skills in Python. Interests in games and networks could be useful.

The project will mainly be coding to simulate the process of players in a game on a graph. Here is a more detailed description of the project: https://www.stat.berkeley.edu/~aldous/Research/Stel_project/stellaris_project.html

Yikun Zhang: Introduction to Density-based Clustering and its Applications

Student: Dongfeng Li

Slides | Writeup

Prerequisites: STAT 311 or STAT 340 or equivalent (knowledge of basic probability and statistics), some familiarity with programming in Python or R, etc.

In many, if not most, practical applications, the available observations do not spread evenly over the data space but are instead grouped into several clusters. This project is designed to investigate how to statistically uncover these clusters from observational (point cloud) data through density-based approaches. Such approaches, unlike the hierarchical clustering and other dissimilarity-based methods, leverage the (estimated) density from the data to define the clusters and do not require any dissimilarity metric in the clustering process. Among the family of density-based clustering approaches, we are planning to focus on mode clustering, during which the density kernel estimator and mean shift algorithm will be reviewed and discussed. Theoretically, we may study the consistency of mode clustering and its connection to the EM algorithm. Practically, we may apply the mode clustering to real-world data and present some interesting scientific analyses. Depending on the student’s interest, the project can be either theory-oriented or coding-focused. We are also happy to survey more density-based clustering approaches such as DBSCAN or other clustering methods beyond the density-based domain according to any additional request from the student.

Zhaoqi Li: Introduction to Adaptive Experimental Design

Student: Zilin Huang

Slides | Writeup

Prerequisites: Either some mathematical maturity at the level of STAT 394, or some familiarity with Python.

Suppose you are in Vegas facing three lottery machines, each with a different probability of winning a prize. You would like to figure out which one wins the most, so you try out these machines. After trying out many times, you start thinking about strategies: should I find the lottery machine that has the highest probability of winning the prize and keep playing that machine, or should I find the best way to play so I could lose the least amount of money in 100 rounds? Surprisingly, these two strategies lead to different answers, and lead to two branches in multi-armed bandits. This field has close applications to large tech companies like Amazon, Google, Meta, etc, and connects between statistics, computer science, and economics. In this project, we will first review some well-known approaches in multi-armed bandits, and either give a broad overview of the latest approaches for adaptive experimental design or conduct some experiments to visualize the power of these methods depending on student’s background.