Text analysis 📃

# Text analysis <br> 📃
###

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a>
</span>
</div>

---

# Tidytext analysis

---

## Packages

In addition to `tidyverse` we will be using four other packages today

```r
library(tidytext)
library(genius)
library(wordcloud)
library(DT)
```

---

## Tidytext

- Using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use.
- Learn more at https://www.tidytextmining.com/.

---

## What is tidy text?

```r
text <- c("Take me out tonight",
          "Where there's music and there's people",
          "And they're young and alive",
          "Driving in your car",
          "I never never want to go home",
          "Because I haven't got one",
          "Anymore")

text
```

```
## [1] "Take me out tonight"                   
## [2] "Where there's music and there's people"
## [3] "And they're young and alive"           
## [4] "Driving in your car"                   
## [5] "I never never want to go home"         
## [6] "Because I haven't got one"             
## [7] "Anymore"
```

---

## What is tidy text?

```r
text_df <- tibble(line = 1:7, text = text)

text_df
```

```
## # A tibble: 7 x 2
##    line text                                  
##   <int> <chr>                                 
## 1     1 Take me out tonight                   
## 2     2 Where there's music and there's people
## 3     3 And they're young and alive           
## 4     4 Driving in your car                   
## 5     5 I never never want to go home         
## 6     6 Because I haven't got one             
## # … with 1 more row
```

---

## What is tidy text?

```r
text_df %>%
  unnest_tokens(word, text)
```

```
## # A tibble: 32 x 2
##    line word   
##   <int> <chr>  
## 1     1 take   
## 2     1 me     
## 3     1 out    
## 4     1 tonight
## 5     2 where  
## 6     2 there's
## # … with 26 more rows
```

---

# What are you listening to?

---

## From the "Getting to know you" survey

> "What are your 3 - 5 most favorite songs right now?"

```r
listening <- read_csv("data/listening.csv")
listening
```

```
## # A tibble: 104 x 1
##   songs                                                          
##   <chr>                                                          
## 1 Gamma Knife - King Gizzard and the Lizard Wizard; Self Immolat…
## 2 I dont listen to much music                                    
## 3 Mess by Ed Sheeran, Take me back to london by Ed Sheeran and S…
## 4 Hate Me (Sometimes) - Stand Atlantic; Edge of Seventeen - Stev…
## 5 whistle, gogobebe, sassy me                                    
## 6 Shofukan, Think twice, Padiddle                                
## # … with 98 more rows
```
]

---

## Looking for commonalities

```r
listening %>%
  unnest_tokens(word, songs) %>%
  count(word, sort = TRUE)
```

```
## # A tibble: 786 x 2
##   word      n
##   <chr> <int>
## 1 the      56
## 2 by       23
## 3 to       20
## 4 and      19
## 5 i        19
## 6 you      15
## # … with 780 more rows
```
]

---

## Stop words

- In computing, stop words are words which are filtered out before or after processing of natural language data (text).
- They usually refer to the most common words in a language, but there is not a single list of stop words used by all natural language processing tools.

---

## English stop words

```r
get_stopwords()
```

```
## # A tibble: 175 x 2
##   word   lexicon 
##   <chr>  <chr>   
## 1 i      snowball
## 2 me     snowball
## 3 my     snowball
## 4 myself snowball
## 5 we     snowball
## 6 our    snowball
## # … with 169 more rows
```

---

## Spanish stop words

```r
get_stopwords(language = "es")
```

```
## # A tibble: 308 x 2
##   word  lexicon 
##   <chr> <chr>   
## 1 de    snowball
## 2 la    snowball
## 3 que   snowball
## 4 el    snowball
## 5 en    snowball
## 6 y     snowball
## # … with 302 more rows
```

---

## Various lexicons

See `?get_stopwords` for more info.

```r
get_stopwords(source = "smart")
```

```
## # A tibble: 571 x 2
##   word      lexicon
##   <chr>     <chr>  
## 1 a         smart  
## 2 a's       smart  
## 3 able      smart  
## 4 about     smart  
## 5 above     smart  
## 6 according smart  
## # … with 565 more rows
```
]

---

## Back to: Looking for commonalities

```r
listening %>%
  unnest_tokens(word, songs) %>%
* anti_join(stop_words) %>%
* filter(!(word %in% c("1", "2", "3", "4", "5"))) %>%
  count(word, sort = TRUE)
```

```
## # A tibble: 640 x 2
##   word        n
##   <chr>   <int>
## 1 ed          7
## 2 queen       7
## 3 sheeran     7
## 4 love        6
## 5 bad         5
## 6 time        5
## # … with 634 more rows
```
]

---

## Top 20 common words in songs

```r
top20_songs <- listening %>%
  unnest_tokens(word, songs) %>%
  anti_join(stop_words) %>%
  filter(
    !(word %in% c("1", "2", "3", "4", "5"))
    ) %>%
  count(word) %>%
  top_n(20)
```
]
]
.pull-right[
.midi[

```r
top20_songs %>%
  arrange(desc(n))
```

```
## # A tibble: 41 x 2
##   word        n
##   <chr>   <int>
## 1 ed          7
## 2 queen       7
## 3 sheeran     7
## 4 love        6
## 5 bad         5
## 6 time        5
## # … with 35 more rows
```
]
]
---

## Visualizing commonalities: bar chart

.midi[
<img src="u2-d11-text-analysis_files/figure-html/unnamed-chunk-14-1.png" width="60%" style="display: block; margin: auto;" />
]

---

... the code

```r
ggplot(top20_songs, aes(x = fct_reorder(word, n), y = n)) +
  geom_col() +
  labs(x = "Common words", y = "Count") +
  coord_flip()
```

---

## Visualizing commonalities: wordcloud

---

... and the code

```r
set.seed(1234)
wordcloud(words = top20_songs$word, 
          freq = top20_songs$n, 
          colors = brewer.pal(5,"Blues"),
          random.order = FALSE)
```

---

## Ok, so people like Ed Sheeran!

```r
str_subset(listening$songs, "Sheeran")
```

```
## [1] "Mess by Ed Sheeran, Take me back to london by Ed Sheeran and Sounds of the Skeng by Stormzy"                
## [2] "Ed Sheeran- I don't care, beautiful people, don't"                                                          
## [3] "Truth Hurts by Lizzo , Wetsuit by The Vaccines , Beautiful People by Ed Sheeran"                            
## [4] "Sounds of the Skeng - Stormzy, Venom - Eminem, Take me back to london - Ed Sheeran, I see fire - Ed Sheeran"
```

---

## But I had to ask...

What is 1975?

```r
str_subset(listening$songs, "1975")
```

```
## [1] "Hate Me (Sometimes) - Stand Atlantic; Edge of Seventeen - Stevie Nicks; It's Not Living (If It's Not With You) - The 1975; People - The 1975; Hypersonic Missiles - Sam Fender"
## [2] "Chocolate by the 1975, sanctuary by Joji, A young understating by Sundara Karma"                                                                                               
## [3] "Lauv - I'm lonely, kwassa - good life, the 1975 - sincerity is scary"
```

---

# Analyzing lyrics of one artist

---

## Let's get more data

We'll use the **genius** package to get song lyric data from [Genius](https://genius.com/).

- `genius_album()`: download lyrics for an entire album
- `add_genius()`: download lyrics for multiple albums

---

## Ed's most recent-ish albums

```r
artist_albums <- tribble(
  ~artist,      ~album,
  "Ed Sheeran", "No.6 Collaborations Project",
  "Ed Sheeran", "Divide",
  "Ed Sheeran", "Multiply",
  "Ed Sheeran", "Plus",
)

sheeran <- artist_albums %>%
  add_genius(artist, album, "album")
```

---

## Songs in the four albums

.small[
<div id="htmlwidget-769b12887d9d3e739065" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-769b12887d9d3e739065">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50","51","52","53","54","55","56","57","58","59","60","61","62","63","64","65","66","67","68","69","70","71","72","73","74"],["No.6 Collaborations Project","No.6 Collaborations Project","No.6 Collaborations Project","No.6 Collaborations Project","No.6 Collaborations Project","No.6 Collaborations Project","No.6 Collaborations Project","No.6 Collaborations Project","No.6 Collaborations Project","No.6 Collaborations Project","No.6 Collaborations Project","No.6 Collaborations Project","No.6 Collaborations Project","No.6 Collaborations Project","No.6 Collaborations Project","Divide","Divide","Divide","Divide","Divide","Divide","Divide","Divide","Divide","Divide","Divide","Divide","Divide","Divide","Divide","Divide","Divide","Multiply","Multiply","Multiply","Multiply","Multiply","Multiply","Multiply","Multiply","Multiply","Multiply","Multiply","Multiply","Multiply","Multiply","Multiply","Multiply","Multiply","Multiply","Multiply","Multiply","Multiply","Multiply","Multiply","Plus","Plus","Plus","Plus","Plus","Plus","Plus","Plus","Plus","Plus","Plus","Plus","Plus","Plus","Plus","Plus","Plus","Plus","Plus"],["Beautiful People (Ft. Khalid)","South of the Border (Ft. Camila Cabello & Cardi B)","Take Me Back to London (Ft. Stormzy)","Put It All on Me (Ft. Ella Mai)","Way to Break My Heart (Ft. Skrillex)","BLOW by Ed Sheeran, Chris Stapleton & Bruno Mars","Cross Me (Ft. Chance the Rapper & PnB Rock)","Best Part of Me (Ft. YEBBA)","I Don't Care by Ed Sheeran & Justin Bieber","Antisocial by Ed Sheeran & Travis Scott","Remember the Name (Ft. 50 Cent & Eminem)","Feels (Ft. J Hus & Young Thug)","Nothing on You (Ft. Dave & Paulo Londra)","I Don't Want Your Money (Ft. H.E.R.)","1000 Nights (Ft. A Boogie wit da Hoodie & Meek Mill)","Galway Girl","Happier","What Do I Know?","Supermarket Flowers","Bibia Be Ye Ye","Save Myself","Eraser","Castle on the Hill","Dive","Shape of You","Perfect","New Man","Hearts Don't Break Around Here","How Would You Feel (Paean)","Barcelona","Nancy Mulligan","Perfect Duet by Ed Sheeran & Beyoncé","Tenerife Sea","Runaway","The Man","Thinking Out Loud","Afire Love","Take It Back","Shirtsleeves","Even My Dad Does Sometimes","Reuf by Nekfeu (Ft. Ed Sheeran)","Make It Rain","One","I'm a Mess","Sing (Ft. Pharrell Williams)","Don't","Nina","Photograph","Bloodstream","I See Fire","All of the Stars","English Rose","Touch and Go","New York","Lay It All on Me by Rudimental (Ft. Ed Sheeran)","Drunk","Grade 8","Small Bump","Lego House","Kiss Me","Sofa","The A Team","U.N.I.","Wake Me Up","This","The City","You Need Me, I Don't Need You","Give Me Love","The Parting Glass","Autumn Leaves","Little Bird","Gold Rush","Sunburn","Homeless"]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>album<\/th>\n      <th>track_title<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"p","order":[],"autoWidth":false,"orderClasses":false,"columnDefs":[{"orderable":false,"targets":0}]}},"evals":[],"jsHooks":[]}</script>
]

---

## How long are Ed Sheeran's songs?

Length measured by number of lines

```r
sheeran %>%
  count(track_title, sort = TRUE)
```

```
## # A tibble: 74 x 2
##   track_title                                            n
##   <chr>                                              <int>
## 1 Take It Back                                         101
## 2 The Man                                               96
## 3 Reuf by Nekfeu (Ft. Ed Sheeran)                       85
## 4 South of the Border (Ft. Camila Cabello & Cardi B)    77
## 5 Take Me Back to London (Ft. Stormzy)                  75
## 6 Make It Rain                                          64
## # … with 68 more rows
```

---

## Tidy up your lyrics!

```r
sheeran_lyrics <- sheeran %>%
  unnest_tokens(word, lyric)

sheeran_lyrics
```

```
## # A tibble: 11,578 x 6
##   artist   album           track_n  line track_title        word 
##   <chr>    <chr>             <int> <int> <chr>              <chr>
## 1 Ed Shee… No.6 Collabora…       1     1 Beautiful People … we   
## 2 Ed Shee… No.6 Collabora…       1     1 Beautiful People … are  
## 3 Ed Shee… No.6 Collabora…       1     1 Beautiful People … we   
## 4 Ed Shee… No.6 Collabora…       1     1 Beautiful People … are  
## 5 Ed Shee… No.6 Collabora…       1     1 Beautiful People … we   
## 6 Ed Shee… No.6 Collabora…       1     1 Beautiful People … are  
## # … with 11,572 more rows
```

---

## What are the most common words?

```r
sheeran_lyrics %>%
  count(word, sort = TRUE)
```

```
## # A tibble: 1,928 x 2
##   word      n
##   <chr> <int>
## 1 i       368
## 2 the     367
## 3 you     351
## 4 and     330
## 5 my      249
## 6 a       212
## # … with 1,922 more rows
```

---

## What a romantic!

```r
sheeran_lyrics %>%
  anti_join(stop_words) %>%
  count(word, sort = TRUE)
```

```
## # A tibble: 1,589 x 2
##   word      n
##   <chr> <int>
## 1 love    137
## 2 ye       75
## 3 <NA>     49
## 4 baby     44
## 5 rain     43
## 6 wanna    40
## # … with 1,583 more rows
```
]

---

---

... and the code

```r
sheeran_lyrics %>%
  anti_join(stop_words) %>%
  count(word)%>%
  top_n(20) %>%
  ggplot(aes(fct_reorder(word, n), n)) +
    geom_col() +
    labs(title = "Frequency of Ed Sheeran's lyrics",
         subtitle = "`Love` tops the chart",
         y = "",
         x = "") +
    coord_flip()
```

---

# Sentiment analysis

---

## Sentiment analysis

- One way to analyze the sentiment of a text is to consider the text as a combination of its individual words 
- and the sentiment content of the whole text as the sum of the sentiment content of the individual words

---

## Sentiment lexicons

```r
get_sentiments("afinn")
```

```
## # A tibble: 2,477 x 2
##   word       value
##   <chr>      <dbl>
## 1 abandon       -2
## 2 abandoned     -2
## 3 abandons      -2
## 4 abducted      -2
## 5 abduction     -2
## 6 abductions    -2
## # … with 2,471 more rows
```
]
.pull-right[

```r
get_sentiments("bing") 
```

```
## # A tibble: 6,786 x 2
##   word       sentiment
##   <chr>      <chr>    
## 1 2-faces    negative 
## 2 abnormal   negative 
## 3 abolish    negative 
## 4 abominable negative 
## 5 abominably negative 
## 6 abominate  negative 
## # … with 6,780 more rows
```
]

---

## Sentiment lexicons

```r
get_sentiments("nrc")
```

```
## # A tibble: 13,901 x 2
##   word      sentiment
##   <chr>     <chr>    
## 1 abacus    trust    
## 2 abandon   fear     
## 3 abandon   negative 
## 4 abandon   sadness  
## 5 abandoned anger    
## 6 abandoned fear     
## # … with 13,895 more rows
```
]
.pull-right[

```r
get_sentiments("loughran") 
```

```
## # A tibble: 4,150 x 2
##   word         sentiment
##   <chr>        <chr>    
## 1 abandon      negative 
## 2 abandoned    negative 
## 3 abandoning   negative 
## 4 abandonment  negative 
## 5 abandonments negative 
## 6 abandons     negative 
## # … with 4,144 more rows
```
]

---

## Categorizing sentiments

---

## Sentiments in Sheeran's lyrics

```r
sheeran_lyrics %>%
  inner_join(get_sentiments("bing")) %>%
  count(sentiment, word, sort = TRUE)
```

```
## # A tibble: 201 x 3
##   sentiment word        n
##   <chr>     <chr>   <int>
## 1 positive  love      137
## 2 positive  like       67
## 3 positive  right      17
## 4 positive  well       16
## 5 negative  falling    14
## 6 positive  loved      14
## # … with 195 more rows
```
]

---

**Goal:** Find the top 10 most common words with positive and negative sentiments.

---

### Step 1: Top 10 words for each sentiment

```r
sheeran_lyrics %>%
  inner_join(get_sentiments("bing")) %>%
  count(sentiment, word) %>%
  group_by(sentiment) %>%
  top_n(10) 
```

```
## # A tibble: 22 x 3
## # Groups:   sentiment [2]
##   sentiment word        n
##   <chr>     <chr>   <int>
## 1 negative  break      11
## 2 negative  cold       11
## 3 negative  cry         6
## 4 negative  drunk      10
## 5 negative  fall        6
## 6 negative  falling    14
## # … with 16 more rows
```
]

---

### Step 2: `ungroup()`

```r
sheeran_lyrics %>%
  inner_join(get_sentiments("bing")) %>%
  count(sentiment, word) %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup()
```

```
## # A tibble: 22 x 3
##   sentiment word        n
##   <chr>     <chr>   <int>
## 1 negative  break      11
## 2 negative  cold       11
## 3 negative  cry         6
## 4 negative  drunk      10
## 5 negative  fall        6
## 6 negative  falling    14
## # … with 16 more rows
```
]

---

### Step 3: Save the result

```r
sheeran_top10 <- sheeran_lyrics %>%
  inner_join(get_sentiments("bing")) %>%
  count(sentiment, word) %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup()
```

---

**Goal:** Visualize the top 10 most common words with positive and negative sentiments.

---

### Step 1: Create a bar chart

```r
sheeran_top10 %>%
  ggplot(aes(x = word, y = n, fill = sentiment)) +
  geom_col()
```

<img src="u2-d11-text-analysis_files/figure-html/unnamed-chunk-36-1.png" width="80%" style="display: block; margin: auto;" />
]

---

### Step 2: Order bars by frequency

```r
sheeran_top10 %>%
  ggplot(aes(x = fct_reorder(word, n), y = n, fill = sentiment)) +
  geom_col()
```

<img src="u2-d11-text-analysis_files/figure-html/unnamed-chunk-37-1.png" width="80%" style="display: block; margin: auto;" />
]

---

### Step 3: Facet by sentiment

```r
sheeran_top10 %>%
  ggplot(aes(x = fct_reorder(word, n), y = n, fill = sentiment)) +
  geom_col() +
  facet_wrap(~ sentiment)
```

<img src="u2-d11-text-analysis_files/figure-html/unnamed-chunk-38-1.png" width="80%" style="display: block; margin: auto;" />
]

---

### Step 4: Free the scales!

```r
sheeran_top10 %>%
  ggplot(aes(x = fct_reorder(word, n), y = n, fill = sentiment)) +
  geom_col() +
  facet_wrap(~ sentiment, scales = "free")
```

<img src="u2-d11-text-analysis_files/figure-html/unnamed-chunk-39-1.png" width="80%" style="display: block; margin: auto;" />
]

---

### Step 4: Flip the coordinates

```r
sheeran_top10 %>%
  ggplot(aes(x = fct_reorder(word, n), y = n, fill = sentiment)) +
  geom_col() +
  facet_wrap(~ sentiment, scales = "free") +
  coord_flip()
```

<img src="u2-d11-text-analysis_files/figure-html/unnamed-chunk-40-1.png" width="80%" style="display: block; margin: auto;" />
]

---

### Step 5: Clean up labels

```r
sheeran_top10 %>%
  ggplot(aes(x = fct_reorder(word, n), y = n, fill = sentiment)) +
  geom_col() +
  facet_wrap(~ sentiment, scales = "free") +
  coord_flip() +
  labs(title = "Sentiments in Ed Sheeran's lyrics", x = "", y = "")
```

<img src="u2-d11-text-analysis_files/figure-html/unnamed-chunk-41-1.png" width="80%" style="display: block; margin: auto;" />
]

---

### Step 6: Remove redundant info

<img src="u2-d11-text-analysis_files/figure-html/unnamed-chunk-42-1.png" width="80%" style="display: block; margin: auto;" />
]

---

## Scoring sentiments

---

## Assign a sentiment score

```r
sheeran_lyrics %>%
  anti_join(stop_words) %>%
  left_join(get_sentiments("afinn")) 
```

```
## # A tibble: 4,072 x 7
##   artist   album       track_n  line track_title     word   value
##   <chr>    <chr>         <int> <int> <chr>           <chr>  <dbl>
## 1 Ed Shee… No.6 Colla…       1     2 Beautiful Peop… l.a       NA
## 2 Ed Shee… No.6 Colla…       1     2 Beautiful Peop… satur…    NA
## 3 Ed Shee… No.6 Colla…       1     2 Beautiful Peop… night     NA
## 4 Ed Shee… No.6 Colla…       1     2 Beautiful Peop… summer    NA
## 5 Ed Shee… No.6 Colla…       1     3 Beautiful Peop… sundo…    NA
## 6 Ed Shee… No.6 Colla…       1     4 Beautiful Peop… lambo…    NA
## # … with 4,066 more rows
```
]

---

```r
sheeran_lyrics %>%
  anti_join(stop_words) %>%
  left_join(get_sentiments("afinn")) %>%
  filter(!is.na(value)) %>%
  group_by(album) %>%
  summarise(total_sentiment = sum(value)) %>%
  arrange(total_sentiment)
```

```
## # A tibble: 4 x 2
##   album                       total_sentiment
##   <chr>                                 <dbl>
## 1 Plus                                     94
## 2 Divide                                   95
## 3 Multiply                                114
## 4 No.6 Collaborations Project             134
```

---

---

## Acknowledgements

- Julia Silge: https://github.com/juliasilge/tidytext-tutorial
- Julia Silge and David Robinson: https://www.tidytextmining.com/
- Josiah Parry: https://github.com/JosiahParry/genius