class: center, middle, inverse, title-slide # Web scraping ##
College of the Atlantic ### --- class: middle # Scraping the web --- ## Scraping the web: what? why? - Increasing amount of data is available on the web -- - These data are provided in an unstructured format: you can always copy&paste, but it's time-consuming and prone to errors -- - Web scraping is the process of extracting this information automatically and transform it into a structured dataset -- - Two different scenarios: - Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy). - Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files. --- class: middle # Web Scraping with rvest --- ## Hypertext Markup Language - Most of the data on the web is still largely available as HTML - It is structured (hierarchical / tree based), but it's often not available in a form useful for analysis (flat / tidy). ```html <html> <head> <title>This is a title</title> </head> <body> <p align="center">Hello world!</p> </body> </html> ``` --- ## rvest .pull-left[ - The **rvest** package makes basic processing and manipulation of HTML data straight forward - It's designed to work with pipelines built with `%>%` ] .pull-right[ <img src="img/rvest.png" width="230" style="display: block; margin: auto 0 auto auto;" /> ] --- ## Core rvest functions - `read_html` - Read HTML data from a url or character string - `html_element ` - Find HTML element using CSS selectors or XPath expressions - `html_elements` - Find HTML elements using CSS selectors or XPath expressions - `html_table` - Parse an HTML table into a data frame - `html_text` - Extract tag pairs' content - `html_name` - Extract tags' names - `html_attrs` - Extract all of each tag's attributes - `html_attr` - Extract tags' attribute value by name --- ## SelectorGadget .pull-left-narrow[ - Open source tool that eases CSS selector generation and discovery - Easiest to use with the [Chrome Extension](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) - Find out more on the [SelectorGadget vignette](https://rvest.tidyverse.org/articles/selectorgadget.html) ] .pull-right-wide[ <img src="img/selector-gadget/selector-gadget.png" width="75%" style="display: block; margin: auto;" /> ] --- ## Using the SelectorGadget <img src="img/selector-gadget/selector-gadget.gif" width="80%" style="display: block; margin: auto;" /> --- <img src="img/selector-gadget/selector-gadget-1.png" width="95%" style="display: block; margin: auto;" /> --- <img src="img/selector-gadget/selector-gadget-2.png" width="95%" style="display: block; margin: auto;" /> --- <img src="img/selector-gadget/selector-gadget-3.png" width="95%" style="display: block; margin: auto;" /> --- <img src="img/selector-gadget/selector-gadget-4.png" width="95%" style="display: block; margin: auto;" /> --- <img src="img/selector-gadget/selector-gadget-5.png" width="95%" style="display: block; margin: auto;" /> --- ## Using the SelectorGadget Through this process of selection and rejection, SelectorGadget helps you come up with the appropriate CSS selector for your needs <img src="img/selector-gadget/selector-gadget.gif" width="65%" style="display: block; margin: auto;" /> --- ## Acknowledgements * This course builds on the materials from [Data Science in a Box](https://datasciencebox.org/) developed by Mine Çetinkaya-Rundel and are adapted under the [Creative Commons Attribution Share Alike 4.0 International](https://github.com/rstudio-education/datascience-box/blob/master/LICENSE.md)