class: center, middle, inverse, title-slide # Text analysis
📃 ### --- layout: true <div class="my-footer"> <span> <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- class: middle # Tidytext analysis --- ## Packages In addition to `tidyverse` we will be using four other packages today ```r library(tidytext) library(genius) library(wordcloud) library(DT) ``` --- ## Tidytext - Using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use. - Learn more at https://www.tidytextmining.com/. --- ## What is tidy text? ```r text <- c("Take me out tonight", "Where there's music and there's people", "And they're young and alive", "Driving in your car", "I never never want to go home", "Because I haven't got one", "Anymore") text ``` ``` ## [1] "Take me out tonight" ## [2] "Where there's music and there's people" ## [3] "And they're young and alive" ## [4] "Driving in your car" ## [5] "I never never want to go home" ## [6] "Because I haven't got one" ## [7] "Anymore" ``` --- ## What is tidy text? ```r text_df <- tibble(line = 1:7, text = text) text_df ``` ``` ## # A tibble: 7 x 2 ## line text ## <int> <chr> ## 1 1 Take me out tonight ## 2 2 Where there's music and there's people ## 3 3 And they're young and alive ## 4 4 Driving in your car ## 5 5 I never never want to go home ## 6 6 Because I haven't got one ## # … with 1 more row ``` --- ## What is tidy text? ```r text_df %>% unnest_tokens(word, text) ``` ``` ## # A tibble: 32 x 2 ## line word ## <int> <chr> ## 1 1 take ## 2 1 me ## 3 1 out ## 4 1 tonight ## 5 2 where ## 6 2 there's ## # … with 26 more rows ``` --- class: middle # What are you listening to? --- ## From the "Getting to know you" survey > "What are your 3 - 5 most favorite songs right now?" .midi[ ```r listening <- read_csv("data/listening.csv") listening ``` ``` ## # A tibble: 104 x 1 ## songs ## <chr> ## 1 Gamma Knife - King Gizzard and the Lizard Wizard; Self Immolat… ## 2 I dont listen to much music ## 3 Mess by Ed Sheeran, Take me back to london by Ed Sheeran and S… ## 4 Hate Me (Sometimes) - Stand Atlantic; Edge of Seventeen - Stev… ## 5 whistle, gogobebe, sassy me ## 6 Shofukan, Think twice, Padiddle ## # … with 98 more rows ``` ] --- ## Looking for commonalities .midi[ ```r listening %>% unnest_tokens(word, songs) %>% count(word, sort = TRUE) ``` ``` ## # A tibble: 786 x 2 ## word n ## <chr> <int> ## 1 the 56 ## 2 by 23 ## 3 to 20 ## 4 and 19 ## 5 i 19 ## 6 you 15 ## # … with 780 more rows ``` ] --- ## Stop words - In computing, stop words are words which are filtered out before or after processing of natural language data (text). - They usually refer to the most common words in a language, but there is not a single list of stop words used by all natural language processing tools. --- ## English stop words ```r get_stopwords() ``` ``` ## # A tibble: 175 x 2 ## word lexicon ## <chr> <chr> ## 1 i snowball ## 2 me snowball ## 3 my snowball ## 4 myself snowball ## 5 we snowball ## 6 our snowball ## # … with 169 more rows ``` --- ## Spanish stop words ```r get_stopwords(language = "es") ``` ``` ## # A tibble: 308 x 2 ## word lexicon ## <chr> <chr> ## 1 de snowball ## 2 la snowball ## 3 que snowball ## 4 el snowball ## 5 en snowball ## 6 y snowball ## # … with 302 more rows ``` --- ## Various lexicons See `?get_stopwords` for more info. .midi[ ```r get_stopwords(source = "smart") ``` ``` ## # A tibble: 571 x 2 ## word lexicon ## <chr> <chr> ## 1 a smart ## 2 a's smart ## 3 able smart ## 4 about smart ## 5 above smart ## 6 according smart ## # … with 565 more rows ``` ] --- ## Back to: Looking for commonalities .small[ ```r listening %>% unnest_tokens(word, songs) %>% * anti_join(stop_words) %>% * filter(!(word %in% c("1", "2", "3", "4", "5"))) %>% count(word, sort = TRUE) ``` ``` ## # A tibble: 640 x 2 ## word n ## <chr> <int> ## 1 ed 7 ## 2 queen 7 ## 3 sheeran 7 ## 4 love 6 ## 5 bad 5 ## 6 time 5 ## # … with 634 more rows ``` ] --- ## Top 20 common words in songs .pull-left[ .small[ ```r top20_songs <- listening %>% unnest_tokens(word, songs) %>% anti_join(stop_words) %>% filter( !(word %in% c("1", "2", "3", "4", "5")) ) %>% count(word) %>% top_n(20) ``` ] ] .pull-right[ .midi[ ```r top20_songs %>% arrange(desc(n)) ``` ``` ## # A tibble: 41 x 2 ## word n ## <chr> <int> ## 1 ed 7 ## 2 queen 7 ## 3 sheeran 7 ## 4 love 6 ## 5 bad 5 ## 6 time 5 ## # … with 35 more rows ``` ] ] --- ## Visualizing commonalities: bar chart .midi[ <img src="u2-d11-text-analysis_files/figure-html/unnamed-chunk-14-1.png" width="60%" style="display: block; margin: auto;" /> ] --- ... the code ```r ggplot(top20_songs, aes(x = fct_reorder(word, n), y = n)) + geom_col() + labs(x = "Common words", y = "Count") + coord_flip() ``` --- ## Visualizing commonalities: wordcloud <img src="u2-d11-text-analysis_files/figure-html/unnamed-chunk-16-1.png" width="80%" style="display: block; margin: auto;" /> --- ... and the code ```r set.seed(1234) wordcloud(words = top20_songs$word, freq = top20_songs$n, colors = brewer.pal(5,"Blues"), random.order = FALSE) ``` --- ## Ok, so people like Ed Sheeran! ```r str_subset(listening$songs, "Sheeran") ``` ``` ## [1] "Mess by Ed Sheeran, Take me back to london by Ed Sheeran and Sounds of the Skeng by Stormzy" ## [2] "Ed Sheeran- I don't care, beautiful people, don't" ## [3] "Truth Hurts by Lizzo , Wetsuit by The Vaccines , Beautiful People by Ed Sheeran" ## [4] "Sounds of the Skeng - Stormzy, Venom - Eminem, Take me back to london - Ed Sheeran, I see fire - Ed Sheeran" ``` --- ## But I had to ask... -- What is 1975? -- ```r str_subset(listening$songs, "1975") ``` ``` ## [1] "Hate Me (Sometimes) - Stand Atlantic; Edge of Seventeen - Stevie Nicks; It's Not Living (If It's Not With You) - The 1975; People - The 1975; Hypersonic Missiles - Sam Fender" ## [2] "Chocolate by the 1975, sanctuary by Joji, A young understating by Sundara Karma" ## [3] "Lauv - I'm lonely, kwassa - good life, the 1975 - sincerity is scary" ``` --- class: middle # Analyzing lyrics of one artist --- ## Let's get more data We'll use the **genius** package to get song lyric data from [Genius](https://genius.com/). - `genius_album()`: download lyrics for an entire album - `add_genius()`: download lyrics for multiple albums --- ## Ed's most recent-ish albums ```r artist_albums <- tribble( ~artist, ~album, "Ed Sheeran", "No.6 Collaborations Project", "Ed Sheeran", "Divide", "Ed Sheeran", "Multiply", "Ed Sheeran", "Plus", ) sheeran <- artist_albums %>% add_genius(artist, album, "album") ``` --- ## Songs in the four albums .small[
] --- ## How long are Ed Sheeran's songs? Length measured by number of lines ```r sheeran %>% count(track_title, sort = TRUE) ``` ``` ## # A tibble: 74 x 2 ## track_title n ## <chr> <int> ## 1 Take It Back 101 ## 2 The Man 96 ## 3 Reuf by Nekfeu (Ft. Ed Sheeran) 85 ## 4 South of the Border (Ft. Camila Cabello & Cardi B) 77 ## 5 Take Me Back to London (Ft. Stormzy) 75 ## 6 Make It Rain 64 ## # … with 68 more rows ``` --- ## Tidy up your lyrics! ```r sheeran_lyrics <- sheeran %>% unnest_tokens(word, lyric) sheeran_lyrics ``` ``` ## # A tibble: 11,578 x 6 ## artist album track_n line track_title word ## <chr> <chr> <int> <int> <chr> <chr> ## 1 Ed Shee… No.6 Collabora… 1 1 Beautiful People … we ## 2 Ed Shee… No.6 Collabora… 1 1 Beautiful People … are ## 3 Ed Shee… No.6 Collabora… 1 1 Beautiful People … we ## 4 Ed Shee… No.6 Collabora… 1 1 Beautiful People … are ## 5 Ed Shee… No.6 Collabora… 1 1 Beautiful People … we ## 6 Ed Shee… No.6 Collabora… 1 1 Beautiful People … are ## # … with 11,572 more rows ``` --- ## What are the most common words? ```r sheeran_lyrics %>% count(word, sort = TRUE) ``` ``` ## # A tibble: 1,928 x 2 ## word n ## <chr> <int> ## 1 i 368 ## 2 the 367 ## 3 you 351 ## 4 and 330 ## 5 my 249 ## 6 a 212 ## # … with 1,922 more rows ``` --- ## What a romantic! .midi[ ```r sheeran_lyrics %>% anti_join(stop_words) %>% count(word, sort = TRUE) ``` ``` ## # A tibble: 1,589 x 2 ## word n ## <chr> <int> ## 1 love 137 ## 2 ye 75 ## 3 <NA> 49 ## 4 baby 44 ## 5 rain 43 ## 6 wanna 40 ## # … with 1,583 more rows ``` ] --- <img src="u2-d11-text-analysis_files/figure-html/unnamed-chunk-26-1.png" width="80%" style="display: block; margin: auto;" /> --- ... and the code ```r sheeran_lyrics %>% anti_join(stop_words) %>% count(word)%>% top_n(20) %>% ggplot(aes(fct_reorder(word, n), n)) + geom_col() + labs(title = "Frequency of Ed Sheeran's lyrics", subtitle = "`Love` tops the chart", y = "", x = "") + coord_flip() ``` --- class: middle # Sentiment analysis --- ## Sentiment analysis - One way to analyze the sentiment of a text is to consider the text as a combination of its individual words - and the sentiment content of the whole text as the sum of the sentiment content of the individual words --- ## Sentiment lexicons .pull-left[ ```r get_sentiments("afinn") ``` ``` ## # A tibble: 2,477 x 2 ## word value ## <chr> <dbl> ## 1 abandon -2 ## 2 abandoned -2 ## 3 abandons -2 ## 4 abducted -2 ## 5 abduction -2 ## 6 abductions -2 ## # … with 2,471 more rows ``` ] .pull-right[ ```r get_sentiments("bing") ``` ``` ## # A tibble: 6,786 x 2 ## word sentiment ## <chr> <chr> ## 1 2-faces negative ## 2 abnormal negative ## 3 abolish negative ## 4 abominable negative ## 5 abominably negative ## 6 abominate negative ## # … with 6,780 more rows ``` ] --- ## Sentiment lexicons .pull-left[ ```r get_sentiments("nrc") ``` ``` ## # A tibble: 13,901 x 2 ## word sentiment ## <chr> <chr> ## 1 abacus trust ## 2 abandon fear ## 3 abandon negative ## 4 abandon sadness ## 5 abandoned anger ## 6 abandoned fear ## # … with 13,895 more rows ``` ] .pull-right[ ```r get_sentiments("loughran") ``` ``` ## # A tibble: 4,150 x 2 ## word sentiment ## <chr> <chr> ## 1 abandon negative ## 2 abandoned negative ## 3 abandoning negative ## 4 abandonment negative ## 5 abandonments negative ## 6 abandons negative ## # … with 4,144 more rows ``` ] --- class: middle ## Categorizing sentiments --- ## Sentiments in Sheeran's lyrics .midi[ ```r sheeran_lyrics %>% inner_join(get_sentiments("bing")) %>% count(sentiment, word, sort = TRUE) ``` ``` ## # A tibble: 201 x 3 ## sentiment word n ## <chr> <chr> <int> ## 1 positive love 137 ## 2 positive like 67 ## 3 positive right 17 ## 4 positive well 16 ## 5 negative falling 14 ## 6 positive loved 14 ## # … with 195 more rows ``` ] --- class: middle **Goal:** Find the top 10 most common words with positive and negative sentiments. --- ### Step 1: Top 10 words for each sentiment .midi[ ```r sheeran_lyrics %>% inner_join(get_sentiments("bing")) %>% count(sentiment, word) %>% group_by(sentiment) %>% top_n(10) ``` ``` ## # A tibble: 22 x 3 ## # Groups: sentiment [2] ## sentiment word n ## <chr> <chr> <int> ## 1 negative break 11 ## 2 negative cold 11 ## 3 negative cry 6 ## 4 negative drunk 10 ## 5 negative fall 6 ## 6 negative falling 14 ## # … with 16 more rows ``` ] --- ### Step 2: `ungroup()` .midi[ ```r sheeran_lyrics %>% inner_join(get_sentiments("bing")) %>% count(sentiment, word) %>% group_by(sentiment) %>% top_n(10) %>% ungroup() ``` ``` ## # A tibble: 22 x 3 ## sentiment word n ## <chr> <chr> <int> ## 1 negative break 11 ## 2 negative cold 11 ## 3 negative cry 6 ## 4 negative drunk 10 ## 5 negative fall 6 ## 6 negative falling 14 ## # … with 16 more rows ``` ] --- ### Step 3: Save the result ```r sheeran_top10 <- sheeran_lyrics %>% inner_join(get_sentiments("bing")) %>% count(sentiment, word) %>% group_by(sentiment) %>% top_n(10) %>% ungroup() ``` --- class: middle **Goal:** Visualize the top 10 most common words with positive and negative sentiments. --- ### Step 1: Create a bar chart .midi[ ```r sheeran_top10 %>% ggplot(aes(x = word, y = n, fill = sentiment)) + geom_col() ``` <img src="u2-d11-text-analysis_files/figure-html/unnamed-chunk-36-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ### Step 2: Order bars by frequency .midi[ ```r sheeran_top10 %>% ggplot(aes(x = fct_reorder(word, n), y = n, fill = sentiment)) + geom_col() ``` <img src="u2-d11-text-analysis_files/figure-html/unnamed-chunk-37-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ### Step 3: Facet by sentiment .midi[ ```r sheeran_top10 %>% ggplot(aes(x = fct_reorder(word, n), y = n, fill = sentiment)) + geom_col() + facet_wrap(~ sentiment) ``` <img src="u2-d11-text-analysis_files/figure-html/unnamed-chunk-38-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ### Step 4: Free the scales! .midi[ ```r sheeran_top10 %>% ggplot(aes(x = fct_reorder(word, n), y = n, fill = sentiment)) + geom_col() + facet_wrap(~ sentiment, scales = "free") ``` <img src="u2-d11-text-analysis_files/figure-html/unnamed-chunk-39-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ### Step 4: Flip the coordinates .midi[ ```r sheeran_top10 %>% ggplot(aes(x = fct_reorder(word, n), y = n, fill = sentiment)) + geom_col() + facet_wrap(~ sentiment, scales = "free") + coord_flip() ``` <img src="u2-d11-text-analysis_files/figure-html/unnamed-chunk-40-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ### Step 5: Clean up labels .small[ ```r sheeran_top10 %>% ggplot(aes(x = fct_reorder(word, n), y = n, fill = sentiment)) + geom_col() + facet_wrap(~ sentiment, scales = "free") + coord_flip() + labs(title = "Sentiments in Ed Sheeran's lyrics", x = "", y = "") ``` <img src="u2-d11-text-analysis_files/figure-html/unnamed-chunk-41-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ### Step 6: Remove redundant info .small[ ```r sheeran_top10 %>% ggplot(aes(x = fct_reorder(word, n), y = n, fill = sentiment)) + geom_col() + facet_wrap(~ sentiment, scales = "free") + coord_flip() + labs(title = "Sentiments in Ed Sheeran's lyrics", x = "", y = "") + guides(fill = FALSE) ``` <img src="u2-d11-text-analysis_files/figure-html/unnamed-chunk-42-1.png" width="80%" style="display: block; margin: auto;" /> ] --- class: middle ## Scoring sentiments --- ## Assign a sentiment score .small[ ```r sheeran_lyrics %>% anti_join(stop_words) %>% left_join(get_sentiments("afinn")) ``` ``` ## # A tibble: 4,072 x 7 ## artist album track_n line track_title word value ## <chr> <chr> <int> <int> <chr> <chr> <dbl> ## 1 Ed Shee… No.6 Colla… 1 2 Beautiful Peop… l.a NA ## 2 Ed Shee… No.6 Colla… 1 2 Beautiful Peop… satur… NA ## 3 Ed Shee… No.6 Colla… 1 2 Beautiful Peop… night NA ## 4 Ed Shee… No.6 Colla… 1 2 Beautiful Peop… summer NA ## 5 Ed Shee… No.6 Colla… 1 3 Beautiful Peop… sundo… NA ## 6 Ed Shee… No.6 Colla… 1 4 Beautiful Peop… lambo… NA ## # … with 4,066 more rows ``` ] --- ```r sheeran_lyrics %>% anti_join(stop_words) %>% left_join(get_sentiments("afinn")) %>% filter(!is.na(value)) %>% group_by(album) %>% summarise(total_sentiment = sum(value)) %>% arrange(total_sentiment) ``` ``` ## # A tibble: 4 x 2 ## album total_sentiment ## <chr> <dbl> ## 1 Plus 94 ## 2 Divide 95 ## 3 Multiply 114 ## 4 No.6 Collaborations Project 134 ``` --- <img src="u2-d11-text-analysis_files/figure-html/unnamed-chunk-45-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Acknowledgements - Julia Silge: https://github.com/juliasilge/tidytext-tutorial - Julia Silge and David Robinson: https://www.tidytextmining.com/ - Josiah Parry: https://github.com/JosiahParry/genius