Welcome

These are the materials for workshops on text analysis by Julia Silge. Text data is increasingly important in many domains, and tidy data principles and tidy tools can make text mining easier and more effective. In this workshop, learn how to manipulate, summarize, and visualize the characteristics of text using these methods and R packages from the tidy tool ecosystem. These tools are highly effective for many analytical questions and allow analysts to integrate natural language processing into effective workflows already in wide use. Explore how to implement approaches such as sentiment analysis of texts, measuring tf-idf, network analysis of words, and building both supervised and unsupervised text models.

Is this workshop for me?

This course will be appropriate for you if you answer yes to these questions:

  • Have you ever encountered text data and suspected there was useful insight latent within it but felt frustrated about how to find that insight?
  • Are you familiar with dplyr and ggplot2, and ready to learn how unstructured text data can be analyzed within the tidyverse ecosystem?
  • Do you need a flexible framework for handling text data that allows you to engage in tasks from exploratory data analysis to supervised predictive modeling?

Learning objectives

At the end of this workshop, participants will understand how to:

  • Perform exploratory data analyses of text datasets, including summarization and data visualization
  • Understand and implement both sentiment analysis and tf-idf
  • Use unsupervised models to gain insight into text data
  • Build supervised classification models for text using tidy data principles

Preparation

Please tune into the workshop with a computer that has the following installed (all available for free):

install.packages(c("tidyverse", "tidytext", 
                   "gutenbergr", "widyr",
                   "stopwords", "stm",
                   "tidygraph", "ggraph",
                   "tidymodels", "glmnet", 
                   "vip", "textrecipes"))

Slides

Code

Quarto files for working along are available on GitHub.

Past workshops

Instructor bio

Julia Silge is a data scientist and software engineer at Posit PBC (formerly RStudio) where she works on open source modeling and MLOps tools. She is an author, an international keynote speaker, and a real-world practitioner focusing on data analysis and machine learning. Julia loves text analysis, making beautiful charts, and communicating about technical topics with diverse audiences.