Statistical Language Models in R

Instructor: Dr. Dave Campbell

July 14, 2023

This conference is currently not open for registrations or submissions.

About the Short Course: 1 Day Course

Analysis of numeric and categorical data have unlocked incredible insights over the past century. However tools for these data types are not amenable to the contents of legal documents, news briefings, emails, product descriptions, or social media posts. Text data requires a different set of tools to extract descriptive insights and test hypotheses. This workshop will use R and Tidy verse tools to showcase statistical language modelling tools through hands-on tutorials outlining common use cases. Extensions and more complex modelling approaches will be outlined along with their costs, risks, and potentially more meaningful insights. The first half of the workshop will focus on sentiment analysis; producing descriptive statistics and performing hypothesis tests. The second half will focus on clustering documents and estimating covariates effects on the use of text.

In-Person Event. Location Of Short Courses: University of Ottawa

Who is this course for?

Graduate students and up in statistics and machine learning with at least moderate coding experience. The course is meant to target data scientists and aspiring data scientists pushing them between the fields of statistics and computer science.

Level Of Instruction: Beginner

Learning Outcomes

Participants will
- gain experience with cleaning text data
- understand the relationship between preprocessing and analysis
- perform sentiment analysis converting text to numeric information, and perform statistical tests therein
- understand the complexity and risk hierarchy of sentiment tools including understanding potential biases in text models and when they are/aren’t problematic
- gain experience in using word embedding models and will explore them for quality
- perform document clustering and will gain insight into how covariates can be incorporated.

Course Materials

Materials will include slides and code. All will be made available on GitHub. Slides and code are built using markdown.

Delivery Structure

All participants will be encouraged to run code live in class. The workshop is a conversational mix of running code, discussing strategies, and collective decisions. The course will have several check points so that content delivery amount and depth can be customized to the audience.

Knowledge Assumed

R and tidy verse familiarity will be assumed, but strong python skills are a suitable alternative.

Preparatory Material

Bring a computer with R Studio (version >2021.09.0) and R (version >4.1) installed and be prepared to run the analysis live during the workshop. Code and datasets will be provided for the hands-on workshop.

About the instructor: Dr. Dave Campbell

Dr. Dave Campbell is computational statistics methodologist with a penchant for collaboration and a strong interest in ensuring student employability. He is the Assistant Director of Data Science Applications at the Bank of Canada and also a full Professor in the School of Mathematics and Statistics and the School of Computer Science at Carleton University. Before moving to Ottawa in 2019, Dave was an Associate Professor at Simon Fraser University in the Department of Statistics and Actuarial Science, where he led the creation of their BSc in Data Science. He was the inaugural President of the Data Science and Analytics Section of the Statistical Society of Canada and a co-organizer of the Vancouver Learn Data Science Meetup (>5000 members).

Dr. Campbell researches inferential algorithms at the intersections of statistics with machine learning, computing, and applied mathematics to solve problems in economics, forensic science, environmental toxicology, paleo-climatology, psychology, and more. He has co-authored discussion papers in Bayesian Analysis and the Journal of the Royal Statistical Society (series B) and been awarded over $3.5 million in research grants. His recent projects include estimating the economic impact of extreme climate events, identifying orcas from underwater acoustic hydrophones and predicting when they will cross into shipping lanes, improving uncertainty quantification methods for nuclear magnetic resonance, and developing inferential statistical language processing tools for quantifying covariate effects in language. At Carleton he supervises a team of grad students and post-doctoral fellows. At the Bank of Canada. he is often hiring new recruits with strong computing and statistics skills.

Affiliations: Bank of Canada Carleton University