Spring 2021

Peter Gao: Ethics of Algorithmic Decision Making

Student: Kevin Hoang
Slides , Writeup

  • Prerequisites: None
  • In this project, we'll discuss ethical issues arising from the use of algorithms in decision making, in fields like medicine, policing, and housing. We'll talk about issues ranging from algorithmic bias and disparate impact to data privacy. Finally, we'll introduce statistical definitions of fairness and talk about their benefits and shortcomings. If there's interest, we can work on simulations/data analysis to evaluate statistical definitions of fairness.
  • Alex Ziyu Jiang: Sampling methods, Markov Chain Monte Carlo and Cryptography

    Student: Kathleen Cayha
    Slides , Writeup

  • Prerequisites: Basic knowledge of probability is recommended (STAT 311 level). Some prior coding experience with R would be great but not necessary
  • In this project we learn how to decipher coded messages with the widely-used Markov Chain Monte Carlo method. We will first go through the basics of Markov Chain model after a quick probability warm-up. After that we will learn how to generate samples from a known distribution using a wide range of techniques. Finally, we will apply our tools to a dataset consisting of coded messages and we will see how the 'messy' code will gradually iterate into complete sentences using what we have learned.
  • Alan Min and Anupreet Porwal: Expectations and Sampling methods

    Students: Kai Gong and Aubrey Yan
    Kai Slides , Kai Writeup
    Aubrey Slides , Aubrey Writeup

  • Prerequisites: STAT 340-341 and some knowledge of R ; Basics of expectations and probability distributions.
  • Expectation of a random variable or functions of random variables can be difficult to compute analytically when the probability distribution of those variables are not standard well known distributions. One way to approximate expectations is by “intelligently” drawing samples from the probability distributions. In this project, we will cover several sampling and Monte Carlo methods to draw samples from “difficult distributions” and use these samples to approximate expectations. Particularly, we will look at transformation based sampling, importance sampling, rejection sampling and their popular variants. Finally, we will compare performances of these sampling methods and apply the methodology to a dataset of interest to the student in a Bayesian analysis. We are happy to take two students if more than one person is interested in this project.

    Anna Neufeld: Infectious disease modeling

    Student: Kayla Kenyon
    Slides ,

  • Prerequisites: Some comfort in R; experience with calculus and differential equations will be useful but not required.
  • We will start by reading introductory material on SIR compartmental models for disease modeling, and will work to implement these models in R. These are deterministic differential equation models whose output depends on knowledge of various input parameters. After becoming comfortable with the models, we will discuss how statisticians estimate the parameters of these models using current outbreak data in the face of uncertainty, and how the models are then used for predictions and forecasting. The project will evolve based on the interest and statistical level of the student, but could potentially culminate in an applied COVID-19 modeling project.
  • Michael Pearce: Nonlinear Regression

    Student: Muhammad Anas
    Slides , Writeup

  • Prerequisites: A basic knowledge of linear regression and some experience in R
  • Simple linear regression models can be easy to implement and interpret, but they don't always fit data well! For this project, we'll explore regression methods that relax the assumption of linearity. These might include (based on the interest and/or experience level of the student) polynomial regression, step functions, regression splines, smoothing splines, multivariate adaptive regression splines, and generalized additive models. Hopefully, we'll even see how to validate such models using cross-validation! We will mostly use James et al.'s "An Introduction to Statistical Learning" Chapter 7.
  • Taylor Okonek: Disease Mapping

    Student: Wuwei Zhang

  • Prerequisites: STAT 340; Interest in public health applications; Familiarity with R
  • Disease mapping is an important tool for visualizing spatial data on the prevalence and/or incidence of various diseases. In this project, we’ll discuss different types of spatial data, and explore visualization techniques and their usefulness in conveying relevant information. In particular, we’ll discuss ways to visualize uncertainty in disease mapping, how estimates underlying a disease map inform public health policy, issues with data sparsity and spatial aggregation, and how to obtain the estimates that underlie such maps. We’ll learn about Bayesian hierarchical models and, time permitting, spatial random effect terms. Throughout, we’ll explore concepts using real data from various diseases that are of student interest.
  • Sarah Teichman: Ethics of Algorithmic Decision Making

  • Prerequisites: None
  • In this project, we'll discuss ethical issues arising from the use of algorithms in decision making, in fields like medicine, policing, and housing. We'll talk about issues ranging from algorithmic bias and disparate impact to data privacy. Finally, we'll introduce statistical definitions of fairness and talk about their benefits and shortcomings. If there's interest, we can work on simulations/data analysis to evaluate statistical definitions of fairness.
  • Apara Venkat: Networks and Choice Modeling

    Student: Andrey Risukhin
    Slides , Writeup

  • Prerequisites: Calculus (MATH 126) and exposure to probability theory (STAT 340). Linear Algebra (MATH 308) suggested, but not necessary. A general interest and curiosity about math and the world.
  • Imagine a grocery store that presents its customers with a multitude of cereal options. A rational customer would want to maximize their utility. How do you define the "utility" of an item? How do you model the decisions of a customer when there are random effects? If we have data from decisions made by customers, can we identify the utility of various cereals? Discrete choice models attempt to answer these questions. In the language of networks, this problem is closely related to ranking. How does Google rank the webpages? How do we rank players in sports tournaments? Recent developments have been unifying these fields. In this project, we will wrestle with these questions. First, we will learn about discrete choice models. Then, we will learn about ranking in networks. Finally we will attempt to reconcile the two. Time permitting, we will also run simulations and look at datasets along the way.
  • Winter 2021

    Peter Gao: Survey of Data Journalism

    Student: Rohini Mettu
    Slides , Writeup

  • Prerequisites: None
  • In this project, we'll take a look at recent uses of data and statistics in journalism and discuss their effectiveness in applying statistical methods and communicating results to their readers. If desired, we can focus on a specific area of application (climate change, economics, sports, epidemiology). If there is interest, we can look into replicating and extending a particular example.
  • Richard Guo: Making probability rigorous

    Student: Mark Lamin

  • Prerequisites: Probability theory at the level of MATH/STAT 394 and 395
  • Having sat through the introductory probability course, likely you have heard things like "Lebesgue measure", "sigma algebra", "almost sure convergence" and even "martingale" being mentioned. Do you wonder what they are and why they matter? This is a reading program that will introduce these notions and make the probability you learned *rigorous*. We will read together the acclaimed monograph "Probability with Martingales" by David Williams. Rigorous treatment of probability and measure theory will prepare you for more advanced topics, such as stochastic processes, learning theory and theoretical statistics.
  • Bryan Martin: Statistical Learning with Sparsity

    Student: Jerry Su

  • Prerequisites: Familiarity with regression, up to a STAT 311 level
  • Many modern applications benefit from the principle of less is more. Whether due to practical computation concerns from big data, overfitting concerns from too many parameters, or estimability concerns from a small sample size, statistical models often require sparsity. Sparsity can improve our predictions, help make the patterns we observe in our data more reproducible, and give our model parameters desirable properties. Often, sparsity is imposed through penalization, where we include a term in our model to enforce that some parameters are set equal to zero. We will learn about some of the statistical theory underlying how penalization works, and how it impacts our model output, both mathematically and computationally. We will also learn about and compare different sparsity schemes, such as lasso, group lasso, elastic net, and more. We will focus on understanding the different settings in which we might be interested in different forms of sparsity and apply these tools to real data.
  • Eric Morenz and Yiqun Chen: See what's not there

    Student: Suh Young Choi
    Slides , Writeup

  • Prerequisites: Experience with linear regression, probability, or data manipulation will allow a deep dive into the content. It is not a requirement for students who are interested in the subject.
  • In this project, we will take a look at the concept of identification in the context of missing data (and causal inference, if time permits or there is interest!!). While no glamorous artificial intelligence buzzwords are involved in the project per se, remember that your model is just as good as your data (and as we will see by the end of the quarter, as good as your identification assumptions!). We will be drawing from various sources (e.g., Chapter 6 in Foundations of Agonistic Statistics) in the hope of flexible schedule/materials given your background and interest. We will consider a few empirical problems as well, from TidyTuesday data sets to political polls.
  • Taylor Okonek: Topics in Biostatistics

    Student: Anna Elias-Warren
    Slides , Writeup

  • Prerequisites: Introductory statistics a plus but not required, interest in public health applications
  • In this project, we’ll first broadly discuss some of the main pillars of the field of biostatistics, and then focus on a more specific topic for the remainder of the quarter. The main pillars we'll discuss include design of clinical trials, survival analysis, and infectious disease modeling. The focused part of this project can be greatly driven by student interest. Possible directions include: doing a deep-dive into the design of COVID-19 vaccine trials; reading articles about and discussing implications of communicating public health analyses to the public; gaining a broad understanding of how infectious disease models have influenced policy throughout the world; reading about ethical issues in global health studies; and more!
  • Michael Pearce: Nonlinear Regression

    Student: Alejandro Gonzalez
    Slides , Writeup

  • Prerequisites: A basic knowledge of linear regression and some experience in R
  • Simple linear regression models can be easy to implement and interpret, but they don't always fit data well! For this project, we'll explore regression methods that relax the assumption of linearity. These might include (based on the interest and/or experience level of the student) polynomial regression, step functions, regression splines, smoothing splines, multivariate adaptive regression splines, and generalized additive models. Hopefully, we'll even see how to validate such models using cross-validation! We will mostly use James et al.'s "An Introduction to Statistical Learning" Chapter 7.
  • Sarah Teichman: Multivariate Data Analysis

    Student: Lindsey Gao
    Slides , Writeup

  • Prerequisites: Stat 311, and linear algebra would be helpful but not necessary
  • Almost all datasets collected across disciplines are multivariate, which means that multiple variables are measured. Recent technological advances have let researchers collect datasets with hundreds or thousands of variables. Methods from introductory statistics can be used to measure the relationship between a small subset of variables, but new methods are required to consider all of the data simultaneously. Multivariate data analysis is a set of tools to visualize, explore, and make inference about this type of data. In this project, we will use the textbook "An Introduction to Applied Multivariate Analysis with R" to learn about several methods for multivariate data analysis, including principal components analysis, multidimensional scaling, and clustering. We will choose a dataset of interest at the beginning and apply each of our methods to this dataset, leading to a final data analysis and comparison across methods.
  • Seth Temple: Statistical Genetics and Identity by Descent

    Student: Selma Chihab
    Slides , Writeup

  • Prerequisites: STAT 311; some programming experience preferred
  • We will explore statistical theory and methodology as it applies to the study of (human) heredity. The overarching theme of the readings are (1) to compute measures of relatedness (kinship and inbreeding) and conditional trait (disease) risk based on known family trees and (2) to estimate relatedness given dense SNP or entire genome sequence data. Readings will follow UW emeritus professor Elizabeth Thompson’s monograph “Statistical Inference from Genetic on Pedigrees”. We will cover conditional probabilities, likelihood models, Hardy-Weinberg equilibrium, the expectation-maximization algorithm to infer allele frequencies for the ABO blood group, Wright’s path counting formula, and identity by descent. During meetings we will work through practice exercises; for 1 or 2 meetings we will go through brief hands-on labs using current research software
  • Apara Venkat: Networks and Choice Modeling

    Student: Xuling Yang
    Slides , Writeup

  • Prerequisites: Calculus (MATH 126) and exposure to probability theory (STAT 340). Linear Algebra (MATH 308) suggested, but not necessary. A general interest and curiosity about math and the world.
  • Imagine a grocery store that presents its customers with a multitude of cereal options. A rational customer would want to maximize their utility. How do you define the "utility" of an item? How do you model the decisions of a customer when there are random effects? If we have data from decisions made by customers, can we identify the utility of various cereals? Discrete choice models attempt to answer these questions. In the language of networks, this problem is closely related to ranking. How does Google rank the webpages? How do we rank players in sports tournaments? Recent developments have been unifying these fields. In this project, we will wrestle with these questions. First, we will learn about discrete choice models. Then, we will learn about ranking in networks. Finally we will attempt to reconcile the two. Time permitting, we will also run simulations and look at datasets along the way.
  • Jerry Wei: Topological Data Analysis

    Student: Joia Zhang
    Slides , Writeup

  • Prerequisites: Exposure to probability theory and linear algebra
  • Topological Data Analysis (TDA) broadly is about data analysis methods that find structure in data. This includes a lot of topics, and we will focus on clustering and nonlinear dimension reduction. We will study some textbook chapters and some classical papers. We may also go into mode estimation and manifold estimation if interested.
  • Kenny Zhang: Deep Learning for Computer Vision

    Student: Angela Zhao
    Slides , Writeup

  • Prerequisites: Proficiency in a programming language (preferably python). Some exposure in basic probability rules and computer science would be helpful.
  • Image data is prevalent in our lives and modern deep learning provides a powerful tool to deal with image data. We will start with logistic regression and study what is a neural network. Then we will move on to convolutional neural network and some coding exercises. If time allowed, we can delve more into the state-of-the-art Generative Adversarial Networks (GAN) and more complicated tasks like segmentation.
  • Autumn 2020

    Peter Gao: Statistics for Data Journalism: Election Forecasting

    Student: Andy Qin
    Slides

  • Prerequisites: Experience with introductory stats (at the level of any of the intro classes) would help.
  • In this project, we'll take a look at how leading newspapers and researchers conduct polls, forecast elections, and calculate polling averages. If there is interest, we can work on reverse engineering some of the methods used by publications such as FiveThirtyEight, RealClearPolitics, and the Upshot. Finally, we'll consider the ethics of forecasting elections and using statistics in general to study our election process.
  • Zhaoqi Li: Statistical Illusions

    Student: Yeji Sohn
    Slides

  • Prerequisites: Motivation to think about interesting problems and readiness for the brain to be teased. Some mathematical maturity would be beneficial.
  • Do you know that there is a “statistically significant” relationship between your salary and if you pee at night? Do you know that you will always wait longer than others at a bus stop? Do you know that a lot of the statistical concepts you learned in class actually don’t make sense? In this quarter-long study, we will dive into some common misconceptions about statistics and the questions of how to interpret statistics. We will touch on a wide range of statistical topics from a paradoxical view and learn the intuition behind them. No prior knowledge of statistics is required but motivation is encouraged.
  • Shane Lubold: Random Network Models

    Student: Peter Liu
    Slides

  • Prerequisites: Intro statistics and some programming experience (R or Python).
  • Network data, which consists of edges or relationships between nodes, is an important type of data. Many statistical models have been proposed to understand and model this type of data. Some are simple models which assume that all actors form connections with the same probability, while others are more complicated and use node-specific characteristics to determine the probability of an edge. In this project we will first review common network models (such as the Erdos-Renyi model, stochastic block model, and latent space model) and discuss why they might be useful in practice. We will then fit these models to data sets from the Stanford Network Analysis Project and discuss why some models fit better than others. The goal of the project is to understand how network data can arise in the real world and how network properties determine which models are reasonable. If we have time, we can also look at dynamic networks (networks that change over time) and see if we can model them using any of the models discussed above.
  • Bryan Martin: Ethics in Data Science and Statistics

    Student: Jinghua Sun
    Slides

  • Prerequisites: None
  • In this project, we will discuss ethical questions and issues that arise in the field of statistics and data science. We will read case studies and work together to develop a lesson that can be taught to introductory statistics students as part of an undergraduate curriculum. By the end of this project, I hope to have material that I will use in my own courses! Topics will be driven by the student's particular interest, but possible topics will include: the history of statistics and eugenics, race and gender in data science, algorithmic fairness, reproducibility and open science, data transparency, privacy, and human subjects data.
  • Ronak Mehta: The Magical Properties of the SVD

    Student: Claire Gao

  • Prerequisites: Linear Algebra (Math 308 or equivalent). Some statistical background, preferably at the level of 340.
  • The singular value decomposition (SVD) of a matrix has wide relevance in virtually all areas of applied mathematics. This project will consist of three parts: 1) a theory section containing proofs of intriguing properties about the SVD and derivations of three problems of wide importance in statistics and machine learning: principle components analysis (PCA), partial least squares (PLS), and canonical correlations analysis (CCA), all of whose solutions depend heavily on the SVD. 2) a simulation section demonstrating the bias-variance tradeoff of using on method over another for various regression/classification tasks. 3) a real data section in which the student interprets the features learned by these methods on a dataset of their choice. This project is ideal for intermediate statistics students who want to make their linear algebra skills airtight and have a strong mathematical foundation for future success in machine learning.
  • Anna Neufeld: Infectious Disease Modeling

    Student: Harper Zhu
    Slides , Shiny App

  • Prerequisites: Some comfort in R; experience with calculus and differential equations will be useful but not required.
  • We will start by reading introductory material on SIR compartmental models for disease modeling, and will work to implement these models in R. These are deterministic differential equation models whose output depends on knowledge of various input parameters. After becoming comfortable with the models, we will discuss how statisticians estimate the parameters of these models using current outbreak data in the face of uncertainty, and how the models are then used for predictions and forecasting. The project will evolve based on the interest and statistical level of the student, but could potentially culminate in an applied COVID-19 modeling project.
  • Michael Pearce: History and Practice of Data Communication

    Student: Ziyi Li
    Writeup

  • Prerequisites: None; some experience with R or Python may be helpful but is not required.
  • In this course, we'll learn about the development of data communication techniques and their modern use. We'll begin by studying how people have visualized patterns in data over time, and consider how those methods reflected the computational resources available in each era. Then, we'll shift our attention to modern issues in data communication, drawing examples from the COVID-19 pandemic and 2020 US presidential election: How do practitioners effectively show complex relationships or model uncertainty? How do people mislead readers through text and figures (intentionally or otherwise)? What common pitfalls exist, and how can we avoid them? We'll finish with a data communication project based on the student's interests.
  • Subodh Selukar: Introduction to Survival Analysis

    Student: Howard Baek
    Writeup , Shiny App

  • Prerequisites: Familiarity with R; familiarity with survival analysis
  • In many applications, researchers are interested in the time it takes for an outcome of interest to occur: for example, time to death by any cause ("overall survival") is the gold standard outcome for studies in many biomedical fields. Among other characteristics, these data exhibit a special kind of missingness termed "censoring," which requires the use of different statistical methods than other data types. In this project, the student will learn about the characteristics of time-to-event (or "survival") data and basic methods for approaching these data.
  • Sarah Teichman: Phylogenetic Trees

    Student: Lexi Xia
    Writeup

  • Prerequisites: An intro stats class. Some R experience is useful but not required.
  • In this project, we will learn about the application of statistics to evolutionary biology through working with phylogenetic trees. In evolutionary biology, a diagram in the form of a tree is often used to represent the diversification of species over time. In this project, we'll read chapters from the book \emph{Tree Thinking} and choose a dataset to investigate deeply in R in order to understand phylogenies: what they are, how they are used, and how we can use statistics to construct them and test hypotheses about evolution.
  • Seth Temple: Statistical Genetics and Identity by Descent

    Student: Rachel Ferina
    Slides Writeup

  • Prerequisites: None; keen interest in the biological sciences
  • We will explore many classical ways in which statistics has been employed to study heredity in humans and other organisms. For example, we will introduce the expectation-maximization algorithm to infer allele frequencies for ABO blood types and discuss Jacquard’s 9 condensed states of identity by descent. This tutorial will be very practical as we will draw many family trees to compute kinship and inbreeding coefficients. We will use UW emeritus professor Elizabeth Thompson’s monograph "Statistical Inference from Genetic Data on Pedigrees" as reading material. Depending on student interest, we may read more chapters from Thompson’s book, investigate the history of statistical genetics as it relates to the eugenics movement, or code up some computations like the path counting formula.
  • Spring 2020

    We ran a limited number of projects due to COVID-19.

    Sheridan Grant: Causal Inference: DAGs and Potential Outcomes

    Student: Grace Shen
    Slides

  • Prerequisites: Familiarity with linear regression and facility with Gaussian distributions (preferably multivariate)
  • This project will be reading-focused, rather than data analysis. It's intended for a junior or senior student who is interested in learning about Causal Inference--a huge topic in graduate-level statistics and stats research--perhaps as a prelude to applying for PhDs. You'll read classic papers and parts of textbooks on two approaches to causal inference, potential outcomes & graphs. For the final presentation, you'll contrast the two approaches as applied to a problem (practical or theoretical) of your choice.
  • Shane Lubold: Random Graphs

    Student: Gordon An

  • Prerequisites: Some exposure to probability. Some exposure to, or an interest in, graph theory.
  • In this project, we will study random graph theory and how the behavior of these graphs change as the size of the graph grows. We will focus primarily on a simple graph model with a number of interesting properties, the Erdös-Rényi model. In this simple model, we generate a graph on n nodes, where each node connects to any other node with probability p(n), which can depend on the graph size n. We will use theory and simulations to derive key properties of this model, such as the distribution of the degree of a vertex or the number of cliques of any size. We will also explore other exciting properties of this model. For example, if the ratio p*n grows at a certain rate as n gets big, then the graph will, for example, exhibit large cliques with probability 1. The proof of these ideas uses only basic statistical ideas. We will prove the conditions that guarantee this behavior and use simulations to explore how large the graphs must be the see this behavior. This project will expose students to the exciting field of random graphs and will give them a good understanding of how simple statistical tools can answer complex questions.
  • Anna Neufeld: Disease Modeling

    Student: Rachael Ren
    Writeup , Slides

  • Prerequisites: Knowledge of R will be useful!
  • We will start by reading introductory material on SIR compartmental models for disease modeling, and will work to implement these models in R. These are deterministic differential equation models whose output depends on knowledge of various input parameters. After becoming comfortable with the models, we will discuss how statisticians estimate the parameters of these models using current outbreak data in the face of uncertainty, and how the models are then used for predictions and forecasting. The project will evolve based on the interest of the student and relevant currnt events.
  • Winter 2020

    Peter Gao: Introduction to Gaussian Processes

    Student: Hannah Chang

  • Prerequisites: None; interest in programming encouraged
  • As a concept, the Gaussian distribution, often referred to as the normal distribution or the bell curve, has cemented itself in the public consciousness. But what about its finite dimensional generalization, the multivariate Gaussian? Or its infinite dimensional counterpart, the Gaussian process? This project has two main aims: first, to discuss and explore how Gaussian processes arise in various subfields like machine learning and spatial statistics; and second, to develop notes (or a website) that explain Gaussian processes to a general audience. Of course, the exact focus of this project is flexible, based on the reader's interests/background.
  • Kristof Glauninger: Nonparametric Regression

    Student: Eli Grosman
    Writeup

  • Prerequisites: Familiarity with linear regression and basic probability, comfort with algebra, some calculus
  • Nonparametric statistical methods have seen an explosion in popularity as datasets have increased in size and complexity. The goal of this project will be to introduce students who are familiar with parametric regression models to a nonparametric setting. We will explore some of the basic theory and applications of these models, as well as an interesting case where we can achieve parametric convergence rates in a nonparametric setting.
  • Zhaoqi Li: Statistical Machine Learning and Data Analysis

    Student: Zhijun Peng
    Writeup

  • Prerequisites Knowledge of probability theory and Maximum Likelihood Estimation at the level of Stat 340 is preferred; some familiarity of basic programming is preferred; an enthusiasm of reading and experimenting is encouraged.
  • We will discuss the relationship between statistics and machine learning, one of the most popular fields in the world, and how statistical techniques could be used in the machine learning framework. Topics may include classifiers (e.g., Decision Tree, Naive Bayes), training (what is training and the relation to likelihood inference), etc. The design could range from experimental to theoretical, depending on the background of the student.
  • Shane Lubold: Random Graphs

    Student: Tahmin Talukder

  • Prerequisites: Some exposure to probability. Some exposure to, or an interest in, graph theory.
  • In this project, we will study random graph theory and how the behavior of these graphs change as the size of the graph grows. We will focus primarily on a simple graph model with a number of interesting properties, the Erdös-Rényi model. In this simple model, we generate a graph on n nodes, where each node connects to any other node with probability p(n), which can depend on the graph size n. We will use theory and simulations to derive key properties of this model, such as the distribution of the degree of a vertex or the number of cliques of any size. We will also explore other exciting properties of this model. For example, if the ratio p*n grows at a certain rate as n gets big, then the graph will, for example, exhibit large cliques with probability 1. The proof of these ideas uses only basic statistical ideas. We will prove the conditions that guarantee this behavior and use simulations to explore how large the graphs must be the see this behavior. This project will expose students to the exciting field of random graphs and will give them a good understanding of how simple statistical tools can answer complex questions.
  • Bryan Martin: R Package Development

    Student: Thomas Serrano
    Writeup

  • Prerequisites: Familiarity with R
  • Reproducible statistical analysis depends on good software and coding practices. In this project, we will learn how to go from users of R packages to developers of R packages. We will also practice and implement general software developer skills, including documentation, version control, and unit testing. We will learn how to make our code robust, efficient, and user-friendly. Ideally, you will start with an idea of something you are interested in implementing as an R package, whether it be a statistical model, data analysis application, or anything else, though this is not required!
  • Anna Neufeld: Statistical Natural Language Processing

    Student: Christina Nick

  • Prerequisites: Proficiency in a programming language. Knowledge of basic probability rules at the level of Stat 311.
  • In most statistics classes, the data you work with are numbers. Text documents such as books, articles, and speeches provide massive sources of data that can not be analyzed using the tools from your introductory statistics courses. We will explore the field of statistical natural language processing and discuss classification and clustering techniques for text data. Applications of such techniques include translation, information retrieval, fake news detection, and sentiment analysis. After reviewing the literature to get a sense of the general techniques in NLP, we will select a particular text dataset and research question and work on an applied project.
  • Michael Pearce: Nonlinear Regression

    Student: Oliver Bejar Tjalve
    Writeup

  • Prerequisites: A basic knowledge of linear regression and some experience in R
  • Simple linear regression models can be easy to implement and interpret, but they don't always fit data well! For this project, we'll explore regression methods that relax the assumption of linearity. These might include (based on the interest and/or experience level of the student) polynomial regression, step functions, regression splines, smoothing splines, multivariate adaptive regression splines, and generalized additive models. Hopefully, we'll even see how to validate such models using cross-validation! We will mostly use James et al.'s "An Introduction to Statistical Learning" Chapter 7.
  • Anupreet Porwal: Bayesian Linear regression and applications

    Student: Yuchen Sun
    Writeup

  • Prerequisites: Basic knowledge of probability distributions at the level of Stat 394 or Stat 340. Knowledge of Linear Algebra is essential for this project. Familiarity with a programming language may be helpful.
  • Often when we fit models to practical applications, we have some prior understanding of the context of the problem/field which could potentially be useful to tune our model along with the data. For example, if you are trying to model the reply times of emails from dept. chair to professors, information about the designation of professors (full-time/assistant) can be helpful information. Bayesian statistics provides a formal way to incorporate our prior beliefs and information into the model and is particularly useful as it accurately helps to quantify the uncertainty in our inferences. In this project, we wish to discuss basics of Bayes theorem, Bayesian version of Linear regression and if time permits, we will learn about probabilistic matrix factorization (Recommendation systems) and apply these techniques to an interesting problem.
  • Sarah Teichman: Networks

    Student: Josiah Thulin
    Writeup

  • Prerequisites: Stat 311. Some R is useful but not required.
  • Most of the data that you see in STAT 311 are assumed to be independent. However, a lot of interesting datasets include information about individual observations and the relationships between them. This type of data can be analyzed as networks, in which nodes represent individuals and edges represent relationships between them. Networks can be used to study interactions between social groups, the spread of contagious diseases, biological cycles, etc. We will use the textbook "Statistical Analysis of Network Data," along with it's companion text "Statistical Analysis of Network Data in R" by Eric D. Kolaczyk. We will additionally read one or two papers about an application of network analysis and/or analyze a small network in R (based on interest of the student).