Working with regular expressions

class: center, middle, inverse, title-slide

.title[
# Working with regular expressions
]
.subtitle[
## <br><br> College of the Atlantic
]
.author[
### Laurie Baker
]

---

class: middle

# Regular Expressions

---
## Regular Expressions

Pattern matching functions use patterns, otherwise known as "regular expressions" or "regex", to identify specific characteristics in strings.

- Reference material: [R4DS Regular Expressions](https://r4ds.hadley.nz/regexps)

---
## Pattern basics

- We can use `str_view()` to show the elements of the string vector that match the regular expression `<>`.

```r
library(stringr)
library(babynames)
```

The simplest patterns consist of letters and numbers which match those characters:

```r
str_view(string = fruit, pattern = "berry")
```

```
##  [6] │ bil<berry>
##  [7] │ black<berry>
## [10] │ blue<berry>
## [11] │ boysen<berry>
## [19] │ cloud<berry>
## [21] │ cran<berry>
## [29] │ elder<berry>
## [32] │ goji <berry>
## [33] │ goose<berry>
## [38] │ huckle<berry>
## [50] │ mul<berry>
## [70] │ rasp<berry>
## [73] │ salal <berry>
## [76] │ straw<berry>
```

*Note that the first argument in `str_view` is for the `string` and the second argument is the `pattern`.

---
## Literal characters and meta characters

- **Literal characters** Letters and numbers
- **Meta characters** Includes most punctuation characters like `.`, `+`, `*`, `[`, `]`, and `?` which have special meanings. 
    - For example, `.` will match any character. So `"a." will match any string that contains an `"a"` followed by another character:

```r
str_view(c("apple", "orange", "pomegranate"), pattern = "a.")
```

```
## [1] │ <ap>ple
## [2] │ or<an>ge
## [3] │ pomegr<an><at>e
```

We could find names that include an `"a"` followed by three letters, followed by an `"e"`.

```r
baby1880 <- babynames %>%
  filter(year == 1880)

str_view(baby1880$name, pattern = "a...e")
```

```
##  [20] │ C<arrie>
##  [27] │ H<attie>
##  [29] │ M<attie>
##  [46] │ M<aggie>
##  [49] │ F<annie>
##  [59] │ Bl<anche>
##  [64] │ S<allie>
##  [73] │ H<arrie>t
##  [87] │ N<annie>
## [124] │ C<allie>
## [156] │ H<arrie>tt
## [172] │ Kath<arine>
## [214] │ H<allie>
## [224] │ M<argue>rite
## [271] │ C<assie>
## [279] │ M<argie>
## [296] │ Cath<arine>
## [332] │ M<argre>t
## [333] │ Ad<aline>
## [350] │ P<attie>
## ... and 47 more
```

---
## Quantifiers

**Quantifiers** control how many times a pattern can match.

- `?` makes a pattern optional (i.e. it matches 0 or 1 times)
- `+` lets a pattern repeat (i.e. matches at least once)
- `*` lets a pattern be optional or repeat (i.e. it matches any number of times, including 0)

```r
str_view(c("a", "ab", "abb"), pattern = "ab?")
```

```
## [1] │ <a>
## [2] │ <ab>
## [3] │ <ab>b
```

```r
str_view(c("a", "ab", "abb"), pattern = "ab+")
```

```
## [2] │ <ab>
## [3] │ <abb>
```

```r
str_view(c("a", "ab", "abb"), pattern = "ab*")
```

```
## [1] │ <a>
## [2] │ <ab>
## [3] │ <abb>
```

- **Your Turn**: With your partner write a short description of what the pattern matches.

---
## Character Classes

**Character classes** are defined by `[]` and let you match a set of characters

- e.g. `[lmno]` matches "l", "m", "n", or "o".
- starting with `^` will invert the match so that you can match anything **except** "l", "m", "n", or "o".

```r
str_view(words, "[aeiou]x[aeiou]")
```

```
## [284] │ <exa>ct
## [285] │ <exa>mple
## [288] │ <exe>rcise
## [289] │ <exi>st
```

```r
str_view(words, "[^aeiou]y[^aeiou]")
```

```
## [836] │ <sys>tem
## [901] │ <typ>e
```

---
## Your Turn

Work on Exercises 1 and 2.

---
## Alternation

**alternation**, `|` can be used to pick between one or more alternative patterns.

- Fruits with "nut", "berry", or "apple".

```r
str_view(fruit, "apple|nut|berry")
```

```
##  [1] │ <apple>
##  [6] │ bil<berry>
##  [7] │ black<berry>
## [10] │ blue<berry>
## [11] │ boysen<berry>
## [19] │ cloud<berry>
## [20] │ coco<nut>
## [21] │ cran<berry>
## [29] │ elder<berry>
## [32] │ goji <berry>
## [33] │ goose<berry>
## [38] │ huckle<berry>
## [50] │ mul<berry>
## [52] │ <nut>
## [62] │ pine<apple>
## [70] │ rasp<berry>
## [73] │ salal <berry>
## [76] │ straw<berry>
```

- Names with a repeating vowel

```r
str_view(baby1880$name, pattern = "aa|ee|ii|oo|uu")
```

```
##  [300] │ L<ee>
##  [358] │ Kathl<ee>n
##  [416] │ Qu<ee>n
##  [466] │ Aim<ee>
##  [539] │ Qu<ee>nie
##  [725] │ Paral<ee>
##  [739] │ Ail<ee>n
##  [768] │ D<ee>
##  [923] │ Rosal<ee>
##  [932] │ Tenness<ee>
##  [985] │ L<ee>
## [1004] │ Is<aa>c
## [1213] │ Gr<ee>n
## [1279] │ Fr<ee>man
## [1298] │ Elw<oo>d
## [1326] │ D<ee>
## [1466] │ Wh<ee>ler
## [1559] │ Hayw<oo>d
## [1685] │ W<oo>dson
## [1705] │ Cr<ee>d
## ... and 10 more
```

---
## Your Turn

Work on Exercises 3 and 4.

---
## Key stringr functions

- `str_detect` detect the presence or absence of a match
- `str_count` count how many matches are in each string
- `str_extract` extract the matches

---
## Detect matches

`str_detect()` returns a logical vector that is `TRUE` if the pattern matches an element of the character vector and `FALSE` otherwise.

```r
str_detect(c("a", "b", "c"), "[aeiou]")
```

```
## [1]  TRUE FALSE FALSE
```

`str_detect()` works well with `filter`

```r
baby1880 |>
  filter(str_detect(name, pattern = "x")) |>
  count(name, wt = n, sort = TRUE)
```

```
## # A tibble: 13 × 2
##   name          n
##   <chr>     <int>
## 1 Alexander   211
## 2 Alex        147
## 3 Felix        92
## 4 Roxie        62
## 5 Max          52
## 6 Axel         16
## # ℹ 7 more rows
```

---
## Count matches

`str_count()` can be used to count the number of times a pattern occurs in a string.

```r
str_count("ababbaba", pattern = "ab")
```

```
## [1] 3
```

```r
str_view("ababbaba", pattern = "ab")
```

```
## [1] │ <ab><ab>b<ab>a
```

`str_count()` can be easily paired with `mutate()`

```r
# number of vowels and consonants for babynames with an x

babynames |>
  filter(str_detect(name, pattern = "x")) |>
  count(name) |>
  mutate(vowels = str_count(name, "[aeiou]"),
         consonants = str_count(name, "[^aeiou]")
         ) %>%
  arrange(vowels, consonants, n)
```

```
## # A tibble: 974 × 4
##   name      n vowels consonants
##   <chr> <int>  <int>      <int>
## 1 Nyx       9      0          3
## 2 Axl      30      0          3
## 3 Axyl      9      0          4
## 4 Lynx      9      0          4
## 5 Eryx     11      0          4
## 6 Alyx     57      0          4
## # ℹ 968 more rows
```

There's something off with our calculations. What is it?

---
## Regular expressions are case_sensitive

Options: 
- Add uppercase vowels: `str_count(name, pattern = "[aeiouAEIOU]")`
- Tell the regular expression to ignore case:
`str_count(name, regex("[aeiou]", ignore_case = TRUE))`
- Use str_to_lower() to convert the names to lower case:
`str_count(str_to_lower(name), "[aeiou]")`

---
## Replace values

- We can also modify matched with `str_replace()` and `str_replace_all()`. `str_replace` replaces the first match and `str_replace_all()` replaces all matches.

```r
babynames |>
  count(name) |>
  filter(str_detect(name, pattern = "[cC]")) |>
  mutate(new_name = str_replace_all(name, "[cC]", "cup"))
```

```
## # A tibble: 12,917 × 3
##   name        n new_name 
##   <chr>   <int> <chr>    
## 1 Aalicia     1 Aalicupia
## 2 Aalycia     4 Aalycupia
## 3 Aanchal    10 Aancuphal
## 4 Aaric      33 Aaricup  
## 5 Aarica     13 Aaricupa 
## 6 Aarick      5 Aaricupk 
## # ℹ 12,911 more rows
```

---
## Replace values

```r
str_replace(fruit, pattern = "a", replacement = "e") # only the first instance
```

```
##  [1] "epple"             "epricot"           "evocado"          
##  [4] "benana"            "bell pepper"       "bilberry"         
##  [7] "bleckberry"        "bleckcurrant"      "blood orenge"     
## [10] "blueberry"         "boysenberry"       "breedfruit"       
## [13] "cenary melon"      "centaloupe"        "cherimoye"        
## [16] "cherry"            "chili pepper"      "clementine"       
## [19] "cloudberry"        "coconut"           "crenberry"        
## [22] "cucumber"          "current"           "demson"           
## [25] "dete"              "dregonfruit"       "durien"           
## [28] "eggplent"          "elderberry"        "feijoe"           
## [31] "fig"               "goji berry"        "gooseberry"       
## [34] "grepe"             "grepefruit"        "gueva"            
## [37] "honeydew"          "huckleberry"       "jeckfruit"        
## [40] "jembul"            "jujube"            "kiwi fruit"       
## [43] "kumquet"           "lemon"             "lime"             
## [46] "loquet"            "lychee"            "mendarine"        
## [49] "mengo"             "mulberry"          "necterine"        
## [52] "nut"               "olive"             "orenge"           
## [55] "pemelo"            "pepaya"            "pessionfruit"     
## [58] "peech"             "peer"              "persimmon"        
## [61] "physelis"          "pineepple"         "plum"             
## [64] "pomegrenate"       "pomelo"            "purple mengosteen"
## [67] "quince"            "reisin"            "rembutan"         
## [70] "respberry"         "redcurrent"        "rock melon"       
## [73] "selal berry"       "setsuma"           "ster fruit"       
## [76] "strewberry"        "temarillo"         "tengerine"        
## [79] "ugli fruit"        "wetermelon"
```

```r
str_replace_all(fruit, pattern = "a", replacement = "e") # every instance
```

```
##  [1] "epple"             "epricot"           "evocedo"          
##  [4] "benene"            "bell pepper"       "bilberry"         
##  [7] "bleckberry"        "bleckcurrent"      "blood orenge"     
## [10] "blueberry"         "boysenberry"       "breedfruit"       
## [13] "cenery melon"      "centeloupe"        "cherimoye"        
## [16] "cherry"            "chili pepper"      "clementine"       
## [19] "cloudberry"        "coconut"           "crenberry"        
## [22] "cucumber"          "current"           "demson"           
## [25] "dete"              "dregonfruit"       "durien"           
## [28] "eggplent"          "elderberry"        "feijoe"           
## [31] "fig"               "goji berry"        "gooseberry"       
## [34] "grepe"             "grepefruit"        "gueve"            
## [37] "honeydew"          "huckleberry"       "jeckfruit"        
## [40] "jembul"            "jujube"            "kiwi fruit"       
## [43] "kumquet"           "lemon"             "lime"             
## [46] "loquet"            "lychee"            "menderine"        
## [49] "mengo"             "mulberry"          "necterine"        
## [52] "nut"               "olive"             "orenge"           
## [55] "pemelo"            "pepeye"            "pessionfruit"     
## [58] "peech"             "peer"              "persimmon"        
## [61] "physelis"          "pineepple"         "plum"             
## [64] "pomegrenete"       "pomelo"            "purple mengosteen"
## [67] "quince"            "reisin"            "rembuten"         
## [70] "respberry"         "redcurrent"        "rock melon"       
## [73] "selel berry"       "setsume"           "ster fruit"       
## [76] "strewberry"        "temerillo"         "tengerine"        
## [79] "ugli fruit"        "wetermelon"
```

---
## Your Turn

Work on exercises 5 and 6
    
---
## Extracting Patterns

- You can extract pattern matches using `str_extract()`

```r
lab_fees <- c("100 dollars", "10$", "1500 USD")

str_extract(lab_fees, pattern = "[0-9]*")
```

```
## [1] "100"  "10"   "1500"
```

---
## Extract phone numbers case study

#### **Maine phone numbers:**
    - Maine phone numbers begin with 207 and are followed by 7 more digits.
    - Regex expression: "^207[0-9]{3}[0-9]{4}"

```r
phone_numbers <- c("207-846-3630", "207-865-3823", "603-712-5043", "0783-792-9863", "+4478438971066", "207-865-7916")

str_extract(phone_numbers, pattern = "^207-[0-9]{3}-[0-9]{4}")
```

```
## [1] "207-846-3630" "207-865-3823" NA             NA            
## [5] NA             "207-865-7916"
```

*N.B. `{3}` will repeat your regular expression pattern a specified number of times.

---
## Pattern details

- **Escaping** how to match metacharacters. 
- **Anchors** match the start or end of a string.
- More on **character classes** and their shortcuts.
- More on **quantifiers** which control how many times a pattern can match.

---
## Escaping

- Regular expressions use the backslash for escaping metacharacters.
- To match a ., we need the regexp `\.`. 
- Because we use strings to represent the regular expression, and `\` is also used as an escape symbol in strings, to create the regular expression `\.` we need the string `"\\."`

```r
str_view(c("abc", "a.c", "bef"), "a\\.c")
```

```
## [2] │ <a.c>
```

---
## Anchors

By default regular expressions will match any part of a string. To match the start or end you need to **anchor** the regular expression:
- `^` matches the start
- `$` to match the end.

```r
str_view(baby1880$name, "^G")
```

```
##  [19] │ <G>race
##  [25] │ <G>ertrude
##  [84] │ <G>eorgia
## [202] │ <G>enevieve
## [209] │ <G>ussie
## [237] │ <G>eorgie
## [256] │ <G>eorgiana
## [266] │ <G>ertie
## [298] │ <G>eneva
## [312] │ <G>eorge
## [330] │ <G>oldie
## [370] │ <G>ladys
## [393] │ <G>eorgianna
## [394] │ <G>racie
## [529] │ <G>ena
## [700] │ <G>eraldine
## [701] │ <G>ina
## [702] │ <G>lenna
## [703] │ <G>rayce
## [784] │ <G>olda
## ... and 45 more
```

```r
str_view(baby1880$name, "g$")
```

```
##  [704] │ Hedwi<g>
## [1145] │ Irvin<g>
## [1273] │ Kin<g>
## [1382] │ Sterlin<g>
## [1446] │ Youn<g>
## [1580] │ Won<g>
## [1608] │ Ludwi<g>
## [1902] │ Flemin<g>
## [1985] │ Starlin<g>
```

- `\b` can match the boundary between words

```r
str_view(fruit, "\\bapple\\b")
```

```
## [1] │ <apple>
```

```r
str_view(fruit, "apple")
```

```
##  [1] │ <apple>
## [62] │ pine<apple>
```

---
## Your Turn

Try Exercise 7

---
## Character classes

You can construct your own **character class** or **set** with `[]`.
- `[abc]` matches "a", "b", or "c" and `[^abc]` matches any character except "a", "b", or "c". 
- `-` defines a range, e.g., `[a-z]` matches any lower case letter and `[0-9]` matches any number.
- `\` escapes special characters, so `[\^\-\.]` matches `^`, `-`, or `.`

```r
x <- "abcd ABCD 12345 -!@#%."

str_view(x, "[abc]+")
```

```
## [1] │ <abc>d ABCD 12345 -!@#%.
```

```r
str_view(x, "[a-z0-9]+")
```

```
## [1] │ <abcd> ABCD <12345> -!@#%.
```

```r
str_view(x, "[^a-z0-9]+")
```

```
## [1] │ abcd< ABCD >12345< -!@#%.>
```

---
## Character shortcuts

`.` matches any character apart from a newline. There are three other useful pairs:

- `\d` matches any digit; `\D` matches anything that **isn't** a digit
- `\s` matches any whitespace (e.g. space, tab, newline)' `\S` matches anything that isn't whitespace.
- `\w` matches any "word" character, i.e. letters and numbers; `\W` matches any "non-word" character.

```r
x <- "abcd ABCD 12345 -!@#%."

str_view(x, "\\d+")
```

```
## [1] │ abcd ABCD <12345> -!@#%.
```

```r
str_view(x, "\\D+")
```

```
## [1] │ <abcd ABCD >12345< -!@#%.>
```

```r
str_view(x, "\\s+")
```

```
## [1] │ abcd< >ABCD< >12345< >-!@#%.
```

```r
str_view(x, "\\S+")
```

```
## [1] │ <abcd> <ABCD> <12345> <-!@#%.>
```

```r
str_view(x, "\\w+")
```

```
## [1] │ <abcd> <ABCD> <12345> -!@#%.
```

```r
str_view(x, "\\W+")
```

```
## [1] │ abcd< >ABCD< >12345< -!@#%.>
```

---
## Quantifiers

- **Quantifiers** control how many times a pattern matches. 
    - `?` (0 or 1 matches)
    - `+` (1 or more matches)
    - `*` (0 or more matches)

```r
spelling_bee <- c("color", "colour", "coloor")

str_view(spelling_bee, pattern = "colou?r")
```

```
## [1] │ <color>
## [2] │ <colour>
```

```r
lab_fees <- c("100 dollars", "10$", "1500 USD")
# \\d+ for one or more digits
str_extract(lab_fees, pattern = "\\d+")
```

```
## [1] "100"  "10"   "1500"
```

---
## Number of matches

You can specify the number of matches precisely with `{}`

- `{n}` matches exactly n times
- `{n,}` matches **at least** n times
- `{n,m}` matches between n and m times

---
## Operator precedence and parentheses

- Does `ab+` match "a" followed by one or more "b"s, or does it match "ab" repeated any number of times?

- What does `^a|b$` match? Does it match the complete *string a* **OR** the complete *string b* or does it match a string starting with "a" or a string ending with "b"?

---
## Operator precedence and parentheses

- It's similar to PEMDAS or BEMDAS.
- Quantifiers have high precedence and alternation has low precedence which means `ab+` is equivalent to `a(b+)`, and `^a|b$` is equivalent to `(^a)|(b$)`.
- You can use parentheses to override the usual order or to match anything in the middle try `.+`.

```r
word_ex <- c("abba", "ababab", "arbitrary", "about", "aplomb")

str_view(word_ex, pattern = "ab+")
```

```
## [1] │ <abb>a
## [2] │ <ab><ab><ab>
## [4] │ <ab>out
```

```r
str_view(word_ex, pattern = "(ab)+")
```

```
## [1] │ <ab>ba
## [2] │ <ababab>
## [4] │ <ab>out
```

```r
str_view(word_ex, pattern = "^a|b$")
```

```
## [1] │ <a>bba
## [2] │ <a>baba<b>
## [3] │ <a>rbitrary
## [4] │ <a>bout
## [5] │ <a>plom<b>
```

```r
str_view(word_ex, pattern = "^a.+b$")
```

```
## [2] │ <ababab>
## [5] │ <aplomb>
```

---
## Recap of Common expressions:
    - "a"  = is the letter "a"
    - "^a" = starts with the letter "a"
    - "a$" = ends with the letter "a"
    - "[ ]" = contains any letter (or number) within the brackets
    - "[ - ]" = contains any letter (or number) within this range
    - "[^ae]" = everything except these letters (or numbers)
    - "{3}" = repeats a regular expression.

For more expressions or examples, refer to <http://www.regular-expressions.info/refquick.html>

---
# Thank you