winter 2020 Projects

Anna Neufeld: Statistical Natural Language Processing

Student: Christina Nick

Slides | Writeup

Prerequisites: Proficiency in a programming language. Knowledge of basic probability rules at the level of Stat 311.

In most statistics classes, the data you work with are numbers. Text documents such as books, articles, and speeches provide massive sources of data that can not be analyzed using the tools from your introductory statistics courses. We will explore the field of statistical natural language processing and discuss classification and clustering techniques for text data Applications of such techniques include translation, information retrieval, fake news detection, and sentiment analysis. After reviewing the literature to get a sense of the general techniques in NLP, we will select a particular text dataset and research question and work on an applied project.

Anupreet Porwal: Bayesian Linear regression and applications

Student: Yuchen Sun

Slides | Writeup

Prerequisites: Basic knowledge of probability distributions at the level of Stat 394 or Stat 340. Knowledge of Linear Algebra is essential for this project. Familiarity with a programming language may be helpful.

Often when we fit models to practical applications, we have some prior understanding of the context of the problem/field which could potentially be useful to tune our model along with the data. For example, if you are trying to model the reply times of emails from dept. chair to professors, information about the designation of professors (full-time/assistant) can be helpful information. Bayesian statistics provides a formal way to incorporate our prior beliefs and information into the model and is particularly useful as it accurately helps to quantify the uncertainty in our inferences. In this project, we wish to discuss basics of Bayes theorem, Bayesian version of Linear regression and if time permits, we will learn about probabilistic matrix factorization (Recommendation systems) and apply these techniques to an interesting problem.

Bryan Martin: R Package Development

Student: Thomas Serrano

Slides | Writeup

Prerequisites: Familiarity with R

Reproducible statistical analysis depends on good software and coding practices. In this project, we will learn how to go from users of R packages to developers of R packages. We will also practice and implement general software developer skills, including documentation, version control, and unit testing. We will learn how to make our code robust, efficient, and user-friendly. Ideally, you will start with an idea of something you are interested in implementing as an R package, whether it be a statistical model, data analysis application, or anything else, though this is not required!

Kristof Glauninger: Nonparametric Regression

Student: Eli Grosman

Slides | Writeup

Prerequisites: Familiarity with linear regression and basic probability, comfort with algebra, some calculus

Nonparametric statistical methods have seen an explosion in popularity as datasets have increased in size and complexity. The goal of this project will be to introduce students who are familiar with parametric regression models to a nonparametric setting. We will explore some of the basic theory and applications of these models, as well as an interesting case where we can achieve parametric convergence rates in a nonparametric setting.

Michael Pearce: Nonlinear Regression

Student: Oliver Bejar Tjalve

Slides | Writeup

Prerequisites: A basic knowledge of linear regression and some experience in R

Simple linear regression models can be easy to implement and interpret, but they don’t always fit data well! For this project, we’ll explore regression methods that relax the assumption of linearity. These might include (based on the interest and/or experience level of the student) polynomial regression, step functions, regression splines, smoothing splines multivariate adaptive regression splines, and generalized additive models. Hopefully, we’ll even see how to validate such models using cross-validation! We will mostly use James et al.’s “An Introduction to Statistical Learning” Chapter 7.

Peter Gao: Introduction to Gaussian Processes

Student: Hannah Chang

Slides | Writeup

Prerequisites: None; interest in programming encouraged

As a concept, the Gaussian distribution, often referred to as the normal distribution or the bell curve, has cemented itself in the public consciousness. But what about its finite dimensional generalization, the multivariate Gaussian? Or its infinite dimensional counterpart, the Gaussian process? This project has two main aims: first, to discuss and explore how Gaussian processes arise in various subfields like machine learning and spatial statistics; and second, to develop notes (or a website) that explain Gaussian processes to a general audience. Of course, the exact focus of this project is flexible, based on the reader’s interests/background.

Sarah Teichman: Networks

Student: Josiah Thulin

Slides | Writeup

Prerequisites: Stat 311. Some R is useful but not required.

Most of the data that you see in STAT 311 are assumed to be independent. However, a lot of interesting datasets include information about individual observations and the relationships between them. This type of data can be analyzed as networks, in which nodes represent individuals and edges represent relationships between them. Networks can be used to study interactions between social groups, the spread of contagious diseases, biological cycles, etc. We will use the textbook “Statistical Analysis of Network Data,” along with it’s companion text “Statistical Analysis of Network Data in R” by Eric D. Kolaczyk. We will additionally read one or two papers about an application of network analysis and/or analyze a small network in R (based on interest of the student).

Shane Lubold: Random Graphs

Student: Tahmin Talukder

Slides | Writeup

Prerequisites: Some exposure to probability. Some exposure to, or an interest in, graph theory.

In this project, we will study random graph theory and how the behavior of these graphs change as the size of the graph grows. We will focus primarily on a simple graph model with a number of interesting properties, the Erdös-Rényi model. In this simple model, we generate a graph on n nodes, where each node connects to any other node with probability p(n), which can depend on the graph size n. We will use theory and simulations to derive key properties of this model, such as the distribution of the degree of a vertex or the number of cliques of any size. We will also explore other exciting properties of this model. For example, if the ratio p*n grows at a certain rate as n gets big, then the graph will, for example, exhibit large cliques with probability 1. The proof of these ideas uses only basic statistical ideas. We will prove the conditions that guarantee this behavior and use simulations to explore how large the graphs must be the see this behavior. This project will expose students to the exciting field of random graphs and will give them a good understanding of how simple statistical tools can answer complex questions.

Zhaoqi Li: Statistical Machine Learning and Data Analysis

Student: Zhijun Peng

Slides | Writeup

Prerequisites: Knowledge of probability theory and Maximum Likelihood Estimation at the level of Stat 340 is preferred; some familiarity of basic programming is preferred; an enthusiasm of reading and experimenting is encouraged.

We will discuss the relationship between statistics and machine learning, one of the most popular fields in the world, and how statistical techniques could be used in the machine learning framework. Topics may include classifiers (e.g., Decision Tree, Naive Bayes), training (what is training and the relation to likelihood inference), etc. The design could range from experimental to theoretical, depending on the background of the student.