class: center, middle, inverse, title-slide .title[ # Working with regular expressions ] .subtitle[ ##
College of the Atlantic ] .author[ ### Laurie Baker ] --- class: middle # Regular Expressions --- ## Regular Expressions Pattern matching functions use patterns, otherwise known as "regular expressions" or "regex", to identify specific characteristics in strings. - Reference material: [R4DS Regular Expressions](https://r4ds.hadley.nz/regexps) --- ## Pattern basics - We can use `str_view()` to show the elements of the string vector that match the regular expression `<>`. ```r library(stringr) library(babynames) ``` The simplest patterns consist of letters and numbers which match those characters: ```r str_view(string = fruit, pattern = "berry") ``` ``` ## [6] │ bil<berry> ## [7] │ black<berry> ## [10] │ blue<berry> ## [11] │ boysen<berry> ## [19] │ cloud<berry> ## [21] │ cran<berry> ## [29] │ elder<berry> ## [32] │ goji <berry> ## [33] │ goose<berry> ## [38] │ huckle<berry> ## [50] │ mul<berry> ## [70] │ rasp<berry> ## [73] │ salal <berry> ## [76] │ straw<berry> ``` *Note that the first argument in `str_view` is for the `string` and the second argument is the `pattern`. --- ## Literal characters and meta characters - **Literal characters** Letters and numbers - **Meta characters** Includes most punctuation characters like `.`, `+`, `*`, `[`, `]`, and `?` which have special meanings. - For example, `.` will match any character. So `"a." will match any string that contains an `"a"` followed by another character: ```r str_view(c("apple", "orange", "pomegranate"), pattern = "a.") ``` ``` ## [1] │ <ap>ple ## [2] │ or<an>ge ## [3] │ pomegr<an><at>e ``` We could find names that include an `"a"` followed by three letters, followed by an `"e"`. ```r baby1880 <- babynames %>% filter(year == 1880) str_view(baby1880$name, pattern = "a...e") ``` ``` ## [20] │ C<arrie> ## [27] │ H<attie> ## [29] │ M<attie> ## [46] │ M<aggie> ## [49] │ F<annie> ## [59] │ Bl<anche> ## [64] │ S<allie> ## [73] │ H<arrie>t ## [87] │ N<annie> ## [124] │ C<allie> ## [156] │ H<arrie>tt ## [172] │ Kath<arine> ## [214] │ H<allie> ## [224] │ M<argue>rite ## [271] │ C<assie> ## [279] │ M<argie> ## [296] │ Cath<arine> ## [332] │ M<argre>t ## [333] │ Ad<aline> ## [350] │ P<attie> ## ... and 47 more ``` --- ## Quantifiers **Quantifiers** control how many times a pattern can match. - `?` makes a pattern optional (i.e. it matches 0 or 1 times) - `+` lets a pattern repeat (i.e. matches at least once) - `*` lets a pattern be optional or repeat (i.e. it matches any number of times, including 0) ```r str_view(c("a", "ab", "abb"), pattern = "ab?") ``` ``` ## [1] │ <a> ## [2] │ <ab> ## [3] │ <ab>b ``` ```r str_view(c("a", "ab", "abb"), pattern = "ab+") ``` ``` ## [2] │ <ab> ## [3] │ <abb> ``` ```r str_view(c("a", "ab", "abb"), pattern = "ab*") ``` ``` ## [1] │ <a> ## [2] │ <ab> ## [3] │ <abb> ``` - **Your Turn**: With your partner write a short description of what the pattern matches. --- ## Character Classes **Character classes** are defined by `[]` and let you match a set of characters - e.g. `[lmno]` matches "l", "m", "n", or "o". - starting with `^` will invert the match so that you can match anything **except** "l", "m", "n", or "o". ```r str_view(words, "[aeiou]x[aeiou]") ``` ``` ## [284] │ <exa>ct ## [285] │ <exa>mple ## [288] │ <exe>rcise ## [289] │ <exi>st ``` ```r str_view(words, "[^aeiou]y[^aeiou]") ``` ``` ## [836] │ <sys>tem ## [901] │ <typ>e ``` --- ## Your Turn Work on Exercises 1 and 2. --- ## Alternation **alternation**, `|` can be used to pick between one or more alternative patterns. - Fruits with "nut", "berry", or "apple". ```r str_view(fruit, "apple|nut|berry") ``` ``` ## [1] │ <apple> ## [6] │ bil<berry> ## [7] │ black<berry> ## [10] │ blue<berry> ## [11] │ boysen<berry> ## [19] │ cloud<berry> ## [20] │ coco<nut> ## [21] │ cran<berry> ## [29] │ elder<berry> ## [32] │ goji <berry> ## [33] │ goose<berry> ## [38] │ huckle<berry> ## [50] │ mul<berry> ## [52] │ <nut> ## [62] │ pine<apple> ## [70] │ rasp<berry> ## [73] │ salal <berry> ## [76] │ straw<berry> ``` - Names with a repeating vowel ```r str_view(baby1880$name, pattern = "aa|ee|ii|oo|uu") ``` ``` ## [300] │ L<ee> ## [358] │ Kathl<ee>n ## [416] │ Qu<ee>n ## [466] │ Aim<ee> ## [539] │ Qu<ee>nie ## [725] │ Paral<ee> ## [739] │ Ail<ee>n ## [768] │ D<ee> ## [923] │ Rosal<ee> ## [932] │ Tenness<ee> ## [985] │ L<ee> ## [1004] │ Is<aa>c ## [1213] │ Gr<ee>n ## [1279] │ Fr<ee>man ## [1298] │ Elw<oo>d ## [1326] │ D<ee> ## [1466] │ Wh<ee>ler ## [1559] │ Hayw<oo>d ## [1685] │ W<oo>dson ## [1705] │ Cr<ee>d ## ... and 10 more ``` --- ## Your Turn Work on Exercises 3 and 4. --- ## Key stringr functions - `str_detect` detect the presence or absence of a match - `str_count` count how many matches are in each string - `str_extract` extract the matches --- ## Detect matches `str_detect()` returns a logical vector that is `TRUE` if the pattern matches an element of the character vector and `FALSE` otherwise. ```r str_detect(c("a", "b", "c"), "[aeiou]") ``` ``` ## [1] TRUE FALSE FALSE ``` `str_detect()` works well with `filter` ```r baby1880 |> filter(str_detect(name, pattern = "x")) |> count(name, wt = n, sort = TRUE) ``` ``` ## # A tibble: 13 × 2 ## name n ## <chr> <int> ## 1 Alexander 211 ## 2 Alex 147 ## 3 Felix 92 ## 4 Roxie 62 ## 5 Max 52 ## 6 Axel 16 ## # ℹ 7 more rows ``` --- ## Count matches `str_count()` can be used to count the number of times a pattern occurs in a string. ```r str_count("ababbaba", pattern = "ab") ``` ``` ## [1] 3 ``` ```r str_view("ababbaba", pattern = "ab") ``` ``` ## [1] │ <ab><ab>b<ab>a ``` `str_count()` can be easily paired with `mutate()` ```r # number of vowels and consonants for babynames with an x babynames |> filter(str_detect(name, pattern = "x")) |> count(name) |> mutate(vowels = str_count(name, "[aeiou]"), consonants = str_count(name, "[^aeiou]") ) %>% arrange(vowels, consonants, n) ``` ``` ## # A tibble: 974 × 4 ## name n vowels consonants ## <chr> <int> <int> <int> ## 1 Nyx 9 0 3 ## 2 Axl 30 0 3 ## 3 Axyl 9 0 4 ## 4 Lynx 9 0 4 ## 5 Eryx 11 0 4 ## 6 Alyx 57 0 4 ## # ℹ 968 more rows ``` There's something off with our calculations. What is it? --- ## Regular expressions are case_sensitive Options: - Add uppercase vowels: `str_count(name, pattern = "[aeiouAEIOU]")` - Tell the regular expression to ignore case: `str_count(name, regex("[aeiou]", ignore_case = TRUE))` - Use str_to_lower() to convert the names to lower case: `str_count(str_to_lower(name), "[aeiou]")` --- ## Replace values - We can also modify matched with `str_replace()` and `str_replace_all()`. `str_replace` replaces the first match and `str_replace_all()` replaces all matches. ```r babynames |> count(name) |> filter(str_detect(name, pattern = "[cC]")) |> mutate(new_name = str_replace_all(name, "[cC]", "cup")) ``` ``` ## # A tibble: 12,917 × 3 ## name n new_name ## <chr> <int> <chr> ## 1 Aalicia 1 Aalicupia ## 2 Aalycia 4 Aalycupia ## 3 Aanchal 10 Aancuphal ## 4 Aaric 33 Aaricup ## 5 Aarica 13 Aaricupa ## 6 Aarick 5 Aaricupk ## # ℹ 12,911 more rows ``` --- ## Replace values ```r str_replace(fruit, pattern = "a", replacement = "e") # only the first instance ``` ``` ## [1] "epple" "epricot" "evocado" ## [4] "benana" "bell pepper" "bilberry" ## [7] "bleckberry" "bleckcurrant" "blood orenge" ## [10] "blueberry" "boysenberry" "breedfruit" ## [13] "cenary melon" "centaloupe" "cherimoye" ## [16] "cherry" "chili pepper" "clementine" ## [19] "cloudberry" "coconut" "crenberry" ## [22] "cucumber" "current" "demson" ## [25] "dete" "dregonfruit" "durien" ## [28] "eggplent" "elderberry" "feijoe" ## [31] "fig" "goji berry" "gooseberry" ## [34] "grepe" "grepefruit" "gueva" ## [37] "honeydew" "huckleberry" "jeckfruit" ## [40] "jembul" "jujube" "kiwi fruit" ## [43] "kumquet" "lemon" "lime" ## [46] "loquet" "lychee" "mendarine" ## [49] "mengo" "mulberry" "necterine" ## [52] "nut" "olive" "orenge" ## [55] "pemelo" "pepaya" "pessionfruit" ## [58] "peech" "peer" "persimmon" ## [61] "physelis" "pineepple" "plum" ## [64] "pomegrenate" "pomelo" "purple mengosteen" ## [67] "quince" "reisin" "rembutan" ## [70] "respberry" "redcurrent" "rock melon" ## [73] "selal berry" "setsuma" "ster fruit" ## [76] "strewberry" "temarillo" "tengerine" ## [79] "ugli fruit" "wetermelon" ``` ```r str_replace_all(fruit, pattern = "a", replacement = "e") # every instance ``` ``` ## [1] "epple" "epricot" "evocedo" ## [4] "benene" "bell pepper" "bilberry" ## [7] "bleckberry" "bleckcurrent" "blood orenge" ## [10] "blueberry" "boysenberry" "breedfruit" ## [13] "cenery melon" "centeloupe" "cherimoye" ## [16] "cherry" "chili pepper" "clementine" ## [19] "cloudberry" "coconut" "crenberry" ## [22] "cucumber" "current" "demson" ## [25] "dete" "dregonfruit" "durien" ## [28] "eggplent" "elderberry" "feijoe" ## [31] "fig" "goji berry" "gooseberry" ## [34] "grepe" "grepefruit" "gueve" ## [37] "honeydew" "huckleberry" "jeckfruit" ## [40] "jembul" "jujube" "kiwi fruit" ## [43] "kumquet" "lemon" "lime" ## [46] "loquet" "lychee" "menderine" ## [49] "mengo" "mulberry" "necterine" ## [52] "nut" "olive" "orenge" ## [55] "pemelo" "pepeye" "pessionfruit" ## [58] "peech" "peer" "persimmon" ## [61] "physelis" "pineepple" "plum" ## [64] "pomegrenete" "pomelo" "purple mengosteen" ## [67] "quince" "reisin" "rembuten" ## [70] "respberry" "redcurrent" "rock melon" ## [73] "selel berry" "setsume" "ster fruit" ## [76] "strewberry" "temerillo" "tengerine" ## [79] "ugli fruit" "wetermelon" ``` --- ## Your Turn Work on exercises 5 and 6 --- ## Extracting Patterns - You can extract pattern matches using `str_extract()` ```r lab_fees <- c("100 dollars", "10$", "1500 USD") str_extract(lab_fees, pattern = "[0-9]*") ``` ``` ## [1] "100" "10" "1500" ``` --- ## Extract phone numbers case study #### **Maine phone numbers:** - Maine phone numbers begin with 207 and are followed by 7 more digits. - Regex expression: "^207[0-9]{3}[0-9]{4}" ```r phone_numbers <- c("207-846-3630", "207-865-3823", "603-712-5043", "0783-792-9863", "+4478438971066", "207-865-7916") str_extract(phone_numbers, pattern = "^207-[0-9]{3}-[0-9]{4}") ``` ``` ## [1] "207-846-3630" "207-865-3823" NA NA ## [5] NA "207-865-7916" ``` *N.B. `{3}` will repeat your regular expression pattern a specified number of times. --- ## Pattern details - **Escaping** how to match metacharacters. - **Anchors** match the start or end of a string. - More on **character classes** and their shortcuts. - More on **quantifiers** which control how many times a pattern can match. --- ## Escaping - Regular expressions use the backslash for escaping metacharacters. - To match a ., we need the regexp `\.`. - Because we use strings to represent the regular expression, and `\` is also used as an escape symbol in strings, to create the regular expression `\.` we need the string `"\\."` ```r str_view(c("abc", "a.c", "bef"), "a\\.c") ``` ``` ## [2] │ <a.c> ``` --- ## Anchors By default regular expressions will match any part of a string. To match the start or end you need to **anchor** the regular expression: - `^` matches the start - `$` to match the end. ```r str_view(baby1880$name, "^G") ``` ``` ## [19] │ <G>race ## [25] │ <G>ertrude ## [84] │ <G>eorgia ## [202] │ <G>enevieve ## [209] │ <G>ussie ## [237] │ <G>eorgie ## [256] │ <G>eorgiana ## [266] │ <G>ertie ## [298] │ <G>eneva ## [312] │ <G>eorge ## [330] │ <G>oldie ## [370] │ <G>ladys ## [393] │ <G>eorgianna ## [394] │ <G>racie ## [529] │ <G>ena ## [700] │ <G>eraldine ## [701] │ <G>ina ## [702] │ <G>lenna ## [703] │ <G>rayce ## [784] │ <G>olda ## ... and 45 more ``` ```r str_view(baby1880$name, "g$") ``` ``` ## [704] │ Hedwi<g> ## [1145] │ Irvin<g> ## [1273] │ Kin<g> ## [1382] │ Sterlin<g> ## [1446] │ Youn<g> ## [1580] │ Won<g> ## [1608] │ Ludwi<g> ## [1902] │ Flemin<g> ## [1985] │ Starlin<g> ``` - `\b` can match the boundary between words ```r str_view(fruit, "\\bapple\\b") ``` ``` ## [1] │ <apple> ``` ```r str_view(fruit, "apple") ``` ``` ## [1] │ <apple> ## [62] │ pine<apple> ``` --- ## Your Turn Try Exercise 7 --- ## Character classes You can construct your own **character class** or **set** with `[]`. - `[abc]` matches "a", "b", or "c" and `[^abc]` matches any character except "a", "b", or "c". - `-` defines a range, e.g., `[a-z]` matches any lower case letter and `[0-9]` matches any number. - `\` escapes special characters, so `[\^\-\.]` matches `^`, `-`, or `.` ```r x <- "abcd ABCD 12345 -!@#%." str_view(x, "[abc]+") ``` ``` ## [1] │ <abc>d ABCD 12345 -!@#%. ``` ```r str_view(x, "[a-z0-9]+") ``` ``` ## [1] │ <abcd> ABCD <12345> -!@#%. ``` ```r str_view(x, "[^a-z0-9]+") ``` ``` ## [1] │ abcd< ABCD >12345< -!@#%.> ``` --- ## Character shortcuts `.` matches any character apart from a newline. There are three other useful pairs: - `\d` matches any digit; `\D` matches anything that **isn't** a digit - `\s` matches any whitespace (e.g. space, tab, newline)' `\S` matches anything that isn't whitespace. - `\w` matches any "word" character, i.e. letters and numbers; `\W` matches any "non-word" character. ```r x <- "abcd ABCD 12345 -!@#%." str_view(x, "\\d+") ``` ``` ## [1] │ abcd ABCD <12345> -!@#%. ``` ```r str_view(x, "\\D+") ``` ``` ## [1] │ <abcd ABCD >12345< -!@#%.> ``` ```r str_view(x, "\\s+") ``` ``` ## [1] │ abcd< >ABCD< >12345< >-!@#%. ``` ```r str_view(x, "\\S+") ``` ``` ## [1] │ <abcd> <ABCD> <12345> <-!@#%.> ``` ```r str_view(x, "\\w+") ``` ``` ## [1] │ <abcd> <ABCD> <12345> -!@#%. ``` ```r str_view(x, "\\W+") ``` ``` ## [1] │ abcd< >ABCD< >12345< -!@#%.> ``` --- ## Quantifiers - **Quantifiers** control how many times a pattern matches. - `?` (0 or 1 matches) - `+` (1 or more matches) - `*` (0 or more matches) ```r spelling_bee <- c("color", "colour", "coloor") str_view(spelling_bee, pattern = "colou?r") ``` ``` ## [1] │ <color> ## [2] │ <colour> ``` ```r lab_fees <- c("100 dollars", "10$", "1500 USD") # \\d+ for one or more digits str_extract(lab_fees, pattern = "\\d+") ``` ``` ## [1] "100" "10" "1500" ``` --- ## Number of matches You can specify the number of matches precisely with `{}` - `{n}` matches exactly n times - `{n,}` matches **at least** n times - `{n,m}` matches between n and m times --- ## Operator precedence and parentheses - Does `ab+` match "a" followed by one or more "b"s, or does it match "ab" repeated any number of times? -- - What does `^a|b$` match? Does it match the complete *string a* **OR** the complete *string b* or does it match a string starting with "a" or a string ending with "b"? --- ## Operator precedence and parentheses - It's similar to PEMDAS or BEMDAS. - Quantifiers have high precedence and alternation has low precedence which means `ab+` is equivalent to `a(b+)`, and `^a|b$` is equivalent to `(^a)|(b$)`. - You can use parentheses to override the usual order or to match anything in the middle try `.+`. ```r word_ex <- c("abba", "ababab", "arbitrary", "about", "aplomb") str_view(word_ex, pattern = "ab+") ``` ``` ## [1] │ <abb>a ## [2] │ <ab><ab><ab> ## [4] │ <ab>out ``` ```r str_view(word_ex, pattern = "(ab)+") ``` ``` ## [1] │ <ab>ba ## [2] │ <ababab> ## [4] │ <ab>out ``` ```r str_view(word_ex, pattern = "^a|b$") ``` ``` ## [1] │ <a>bba ## [2] │ <a>baba<b> ## [3] │ <a>rbitrary ## [4] │ <a>bout ## [5] │ <a>plom<b> ``` ```r str_view(word_ex, pattern = "^a.+b$") ``` ``` ## [2] │ <ababab> ## [5] │ <aplomb> ``` --- ## Recap of Common expressions: - "a" = is the letter "a" - "^a" = starts with the letter "a" - "a$" = ends with the letter "a" - "[ ]" = contains any letter (or number) within the brackets - "[ - ]" = contains any letter (or number) within this range - "[^ae]" = everything except these letters (or numbers) - "{3}" = repeats a regular expression. For more expressions or examples, refer to <http://www.regular-expressions.info/refquick.html> --- # Thank you