class: center, middle, inverse, title-slide .title[ # Working with strings ] .subtitle[ ##
College of the Atlantic ] .author[ ### Laurie Baker ] --- class: middle # Working with strings using stringr and regex --- ## Introduction > Strings play a big role in many data cleaning and preparations tasks. - Character strings in R are wrapped with double (`"`) or single quotes (`'`). - Character strings can be letters "a", numbers "1", symbols "&", or both "a1&" - *the [tidyverse style guide](https://style.tidyverse.org/syntax.html#character-vectors) recommends using `"` unless the string contains multiple `"`. --- ## stringr - `stringr` is a lightweight package designed by Hadley Wickham to assist with string manipulation. - `stringr` functions begin with `str_`. Follow with + TAB to see the options. <div class="figure" style="text-align: center"> <img src="img/stringr-autocomplete.png" alt="Stringr autocomplete functions from R4DS" width="60%" /> <p class="caption">Stringr autocomplete functions from R4DS</p> </div> --- ## Overview - Getting Started with stringr - Basic String Operators - Special Characters --- ## Getting Started #### Loading stringr ```r install.packages("stringr") library(stringr) ``` ```r movie_titles <- c("gold diggers of broadway", "gone baby gone", "gone in 60 seconds", "gone with the wind", "good girl, the", "good burger", "goodbye girl, the", "good bye lenin!", "goodfellas", "good luck chuck", "good morning, vietnam", "good night, and good luck.", "good son, the", "good will hunting") strings <- c(" 219 733 8965", "329-293-8753 ", "banana", "595 794 7569", "387 287 6718", "apple", "233.398.9187 ", "482 952 3315", "239 923 8115 and 842 566 4692", "Work: 579-499-7527", "$1000", "Home: 543.355.3679") ``` --- ## Basic String Operators - String operators are basic string manipulation functions - Many of them have equivalent base R functions that are much slower and bulkier --- ## str_to_upper(string) - converts strings to uppercase - ex. Convert all movie_titles to uppercase and store them as movie_titles ```r movie_titles <- str_to_upper(movie_titles) movie_titles ``` ``` ## [1] "GOLD DIGGERS OF BROADWAY" "GONE BABY GONE" ## [3] "GONE IN 60 SECONDS" "GONE WITH THE WIND" ## [5] "GOOD GIRL, THE" "GOOD BURGER" ## [7] "GOODBYE GIRL, THE" "GOOD BYE LENIN!" ## [9] "GOODFELLAS" "GOOD LUCK CHUCK" ## [11] "GOOD MORNING, VIETNAM" "GOOD NIGHT, AND GOOD LUCK." ## [13] "GOOD SON, THE" "GOOD WILL HUNTING" ``` --- ## str_to_lower(string) - converts strings to lowercase - ex. Convert all movie_titles back to lowercase and save as movie_titles ```r movie_titles <- str_to_lower(movie_titles) movie_titles ``` ``` ## [1] "gold diggers of broadway" "gone baby gone" ## [3] "gone in 60 seconds" "gone with the wind" ## [5] "good girl, the" "good burger" ## [7] "goodbye girl, the" "good bye lenin!" ## [9] "goodfellas" "good luck chuck" ## [11] "good morning, vietnam" "good night, and good luck." ## [13] "good son, the" "good will hunting" ``` --- ## str_to_title(string) - converts strings to title case - ex. Convert all movie_titles to titlecase and store them as movie_titles ```r movie_titles <- str_to_title(movie_titles) movie_titles ``` ``` ## [1] "Gold Diggers Of Broadway" "Gone Baby Gone" ## [3] "Gone In 60 Seconds" "Gone With The Wind" ## [5] "Good Girl, The" "Good Burger" ## [7] "Goodbye Girl, The" "Good Bye Lenin!" ## [9] "Goodfellas" "Good Luck Chuck" ## [11] "Good Morning, Vietnam" "Good Night, And Good Luck." ## [13] "Good Son, The" "Good Will Hunting" ``` --- ## `str_length(string)` - Returns the string length - `str_length()` converts factors to strings and also preserves NA's ```r str_length("hello") ``` ``` ## [1] 5 ``` --- ## `str_c(string, sep = "")` - Joins together multiple strings including integers - Is the stringr equivalent to `paste(sep = "")` or `paste0()` ## `str_dup(string, times)` - Duplicates strings by a number of times. - Essentially copy / paste function ```r str_c("Heartbreakers gonna ", str_dup("break, ", 3), "break") ``` ``` ## [1] "Heartbreakers gonna break, break, break, break" ``` --- ## `str_sub(string, start, end)` - Subsets text within a string or vector of strings by specifying start and end positions. - Base equivalent function is `substr()` ```r str_sub("Dr Jekyll", start = 1, end = 2) ``` ``` ## [1] "Dr" ``` --- ## `str_trim(string)` - Removes leading and trailing whitespaces - `side = c("both", "left", "right")` Side argument defaults to "both" - example: trim the whitespace from both sides of every string in strings ```r str_trim(strings, side = "both") ``` ``` ## [1] "219 733 8965" ## [2] "329-293-8753" ## [3] "banana" ## [4] "595 794 7569" ## [5] "387 287 6718" ## [6] "apple" ## [7] "233.398.9187" ## [8] "482 952 3315" ## [9] "239 923 8115 and 842 566 4692" ## [10] "Work: 579-499-7527" ## [11] "$1000" ## [12] "Home: 543.355.3679" ``` --- ## `str_pad()` - Pads strings with whitespace to make them a certain length - Width argument lets users specify the width of the padding - Side argument defaults to "left" - ex. pad "movie_titles" with whitespace to the right such that each title becomes 30 characters long. ```r str_pad(movie_titles, side = "right", width = 30) ``` ``` ## [1] "Gold Diggers Of Broadway " ## [2] "Gone Baby Gone " ## [3] "Gone In 60 Seconds " ## [4] "Gone With The Wind " ## [5] "Good Girl, The " ## [6] "Good Burger " ## [7] "Goodbye Girl, The " ## [8] "Good Bye Lenin! " ## [9] "Goodfellas " ## [10] "Good Luck Chuck " ## [11] "Good Morning, Vietnam " ## [12] "Good Night, And Good Luck. " ## [13] "Good Son, The " ## [14] "Good Will Hunting " ``` --- ## Making sentences with `str_glue` ```r ship_log <- tibble(locations = c("Bar Harbor, Mt. Desert Rock, Great Cranberry Island", "Winter Harbor, Stave Island", "Bass Rock"), day = c("Monday", "Tuesday", "Monday")) ship_log |> mutate(account = str_glue("On {day} we went to {locations}")) ``` ``` ## # A tibble: 3 × 3 ## locations day account ## <chr> <chr> <glue> ## 1 Bar Harbor, Mt. Desert Rock, Great Cranberry Isl… Mond… On Mon… ## 2 Winter Harbor, Stave Island Tues… On Tue… ## 3 Bass Rock Mond… On Mon… ``` --- # Summarize strings with `str_flatten` - `str_flatten()` takes a character vector and combines each element of the vector into a single string. ```r purchases <- data.frame(fruit, name = c("Carmen", "Carmen", "Aziz", "Maganga")) purchases |> group_by(name) |> summarize(fruits = str_flatten(fruit, ", ", last = " and ")) ``` ``` ## # A tibble: 3 × 2 ## name fruits ## <chr> <chr> ## 1 Aziz avocado, blackberry, boysenberry, cherimoya, cloudberr… ## 2 Carmen apple, apricot, bell pepper, bilberry, blood orange, b… ## 3 Maganga banana, blackcurrant, breadfruit, cherry, coconut, dam… ``` --- # Extracting data from strings - Separating into rows: `separate_longer_delim` - Separating into columns: `separate_wider_delim` *similar to `pivot_longer()` and `pivot_wider()`. Helpful when there is information contained within the text that should be in columns or their own row. --- # Separate into rows ```r ship_log <- tibble(locations = c("Bar Harbor, Mt. Desert Rock, Great Cranberry Island", "Winter Harbor, Stave Island", "Bass Rock"), day = c("Monday", "Tuesday", "Thursday")) ship_log |> separate_longer_delim(locations, delim = ",") ``` ``` ## # A tibble: 6 × 2 ## locations day ## <chr> <chr> ## 1 "Bar Harbor" Monday ## 2 " Mt. Desert Rock" Monday ## 3 " Great Cranberry Island" Monday ## 4 "Winter Harbor" Tuesday ## 5 " Stave Island" Tuesday ## 6 "Bass Rock" Thursday ``` --- # Separate into columns by delimiter ```r df <- tibble(quote = c("LB: What did you have for breakfast today?", "GK: Toast and a cup of coffee.", "LB: When did you first become involved in fishing?", "GK: When I was a kid, I used to go out with my uncle Bobby")) df |> separate_wider_delim(quote, delim = ":", names = c("speaker", "quote")) ``` ``` ## # A tibble: 4 × 2 ## speaker quote ## <chr> <chr> ## 1 LB " What did you have for breakfast today?" ## 2 GK " Toast and a cup of coffee." ## 3 LB " When did you first become involved in fishing?" ## 4 GK " When I was a kid, I used to go out with my uncle Bob… ``` --- # Separate into columns by position ```r df2 <- tibble(term = c("S19", "F19", "W20", "S20", "F20")) df2 |> separate_wider_position(term, widths = c(term = 1, year = 2)) ``` ``` ## # A tibble: 5 × 2 ## term year ## <chr> <chr> ## 1 S 19 ## 2 F 19 ## 3 W 20 ## 4 S 20 ## 5 F 20 ``` --- ### Combining `stringr` functions ```r df2 <- tibble(term = c("S19", "F19", "W20", "S20", "F20")) df2 |> separate_wider_position(term, widths = c(term = 1, year = 2)) |> mutate(year = str_c("20", year)) ``` ``` ## # A tibble: 5 × 2 ## term year ## <chr> <chr> ## 1 S 2019 ## 2 F 2019 ## 3 W 2020 ## 4 S 2020 ## 5 F 2020 ``` --- # Special characters in strings - Some characters require an **escape** (`\`) to put the non-printable character into a string. - To see the raw contents of the string use `str_view()` ```r double_quote = "\"" single_quote = "\'" backslash = "\\" newline = "first line \n second line" x <- c(double_quote, single_quote, backslash, newline) x ``` ``` ## [1] "\"" "'" ## [3] "\\" "first line \n second line" ``` ```r str_view(x) ``` ``` ## [1] │ " ## [2] │ ' ## [3] │ \ ## [4] │ first line ## │ second line ``` --- # Thank you