class: center, middle, inverse, title-slide # Scraping top 250 movies on IMDB ##
College of the Atlantic --- class: middle # Top 250 movies on IMDB --- ## Top 250 movies on IMDB Take a look at the source code, look for the tag `table` tag: <br> http://www.imdb.com/chart/top .pull-left[ <img src="img/imdb-top-250.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="img/imdb-top-250-source.png" width="94%" style="display: block; margin: auto;" /> ] --- ## First check if you're allowed! ```r library(robotstxt) paths_allowed("http://www.imdb.com") ``` ``` ## [1] TRUE ``` vs. e.g. ```r paths_allowed("http://www.facebook.com") ``` ``` ## [1] FALSE ``` --- ## Plan <img src="img/plan.png" width="90%" style="display: block; margin: auto;" /> --- ## Plan 1. Read the whole page 2. Scrape movie titles and save as `titles` 3. Scrape years movies were made in and save as `years` 4. Scrape IMDB ratings and save as `ratings` 5. Create a data frame called `imdb_top_250` with variables `title`, `year`, and `rating` --- class: middle # Step 1. Read the whole page --- ## Read the whole page ```r page <- read_html("https://www.imdb.com/chart/top/") page ``` ``` ## {html_document} ## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml"> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html ... ## [2] <body id="styleguide-v2" class="fixed">\n <img ... ``` --- ## A webpage in R - Result is a list with 2 elements ```r typeof(page) ``` ``` ## [1] "list" ``` -- - that we need to convert to something more familiar, like a data frame.... ```r class(page) ``` ``` ## [1] "xml_document" "xml_node" ``` --- class: middle # Step 2. Scrape movie titles and save as `titles` --- ## Scrape movie titles <img src="img/titles.png" width="70%" style="display: block; margin: auto;" /> --- ## Scrape the nodes .pull-left[ ```r page %>% html_nodes(".titleColumn a") ``` ``` ## {xml_nodeset (250)} ## [1] <a href="/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [2] <a href="/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [3] <a href="/title/tt0468569/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [4] <a href="/title/tt0071562/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [5] <a href="/title/tt0050083/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [6] <a href="/title/tt0108052/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [7] <a href="/title/tt0167260/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [8] <a href="/title/tt0110912/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [9] <a href="/title/tt0120737/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [10] <a href="/title/tt0060196/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [11] <a href="/title/tt0109830/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [12] <a href="/title/tt0137523/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [13] <a href="/title/tt0167261/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [14] <a href="/title/tt1375666/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [15] <a href="/title/tt0080684/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [16] <a href="/title/tt0133093/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ... ``` ] .pull-right[ <img src="img/titles.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Extract the text from the nodes .pull-left[ ```r page %>% html_nodes(".titleColumn a") %>% html_text() ``` ``` ## [1] "The Shawshank Redemption" ## [2] "The Godfather" ## [3] "The Dark Knight" ## [4] "The Godfather Part II" ## [5] "12 Angry Men" ## [6] "Schindler's List" ## [7] "The Lord of the Rings: The Return of the King" ## [8] "Pulp Fiction" ## [9] "The Lord of the Rings: The Fellowship of the Ring" ## [10] "The Good, the Bad and the Ugly" ## [11] "Forrest Gump" ## [12] "Fight Club" ## [13] "The Lord of the Rings: The Two Towers" ## [14] "Inception" ## [15] "Star Wars: Episode V - The Empire Strikes Back" ## [16] "The Matrix" ... ``` ] .pull-right[ <img src="img/titles.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Save as `titles` .pull-left[ ```r titles <- page %>% html_nodes(".titleColumn a") %>% html_text() titles ``` ``` ## [1] "The Shawshank Redemption" ## [2] "The Godfather" ## [3] "The Dark Knight" ## [4] "The Godfather Part II" ## [5] "12 Angry Men" ## [6] "Schindler's List" ## [7] "The Lord of the Rings: The Return of the King" ## [8] "Pulp Fiction" ## [9] "The Lord of the Rings: The Fellowship of the Ring" ## [10] "The Good, the Bad and the Ugly" ## [11] "Forrest Gump" ## [12] "Fight Club" ## [13] "The Lord of the Rings: The Two Towers" ## [14] "Inception" ... ``` ] .pull-right[ <img src="img/titles.png" width="100%" style="display: block; margin: auto;" /> ] --- class: middle # Step 3. Scrape year movies were made and save as `years` --- ## Scrape years movies were made in <img src="img/years.png" width="70%" style="display: block; margin: auto;" /> --- ## Scrape the nodes .pull-left[ ```r page %>% html_nodes(".secondaryInfo") ``` ``` ## {xml_nodeset (250)} ## [1] <span class="secondaryInfo">(1994)</span> ## [2] <span class="secondaryInfo">(1972)</span> ## [3] <span class="secondaryInfo">(2008)</span> ## [4] <span class="secondaryInfo">(1974)</span> ## [5] <span class="secondaryInfo">(1957)</span> ## [6] <span class="secondaryInfo">(1993)</span> ## [7] <span class="secondaryInfo">(2003)</span> ## [8] <span class="secondaryInfo">(1994)</span> ## [9] <span class="secondaryInfo">(2001)</span> ## [10] <span class="secondaryInfo">(1966)</span> ## [11] <span class="secondaryInfo">(1994)</span> ## [12] <span class="secondaryInfo">(1999)</span> ## [13] <span class="secondaryInfo">(2002)</span> ## [14] <span class="secondaryInfo">(2010)</span> ## [15] <span class="secondaryInfo">(1980)</span> ## [16] <span class="secondaryInfo">(1999)</span> ... ``` ] .pull-right[ <img src="img/years.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Extract the text from the nodes .pull-left[ ```r page %>% html_nodes(".secondaryInfo") %>% html_text() ``` ``` ## [1] "(1994)" "(1972)" "(2008)" "(1974)" "(1957)" "(1993)" ## [7] "(2003)" "(1994)" "(2001)" "(1966)" "(1994)" "(1999)" ## [13] "(2002)" "(2010)" "(1980)" "(1999)" "(1990)" "(1975)" ## [19] "(1995)" "(1946)" "(1954)" "(1991)" "(1998)" "(2002)" ## [25] "(2014)" "(1997)" "(1999)" "(1977)" "(1991)" "(1985)" ## [31] "(2001)" "(2002)" "(1960)" "(2019)" "(1994)" "(1994)" ## [37] "(2000)" "(1998)" "(2006)" "(2014)" "(2006)" "(1995)" ## [43] "(1942)" "(1988)" "(1962)" "(2011)" "(1936)" "(1968)" ## [49] "(1988)" "(1954)" "(1979)" "(1931)" "(1979)" "(2000)" ## [55] "(2012)" "(1981)" "(2008)" "(2006)" "(1950)" "(1957)" ## [61] "(1980)" "(1940)" "(2018)" "(1957)" "(1986)" "(2018)" ## [67] "(1999)" "(1964)" "(2012)" "(2003)" "(2009)" "(1984)" ## [73] "(2017)" "(1995)" "(1995)" "(2019)" "(1981)" "(2019)" ## [79] "(1997)" "(1984)" "(1997)" "(2016)" "(1952)" "(2009)" ## [85] "(2000)" "(1963)" "(2010)" "(2018)" "(1983)" "(1968)" ## [91] "(2004)" "(1985)" "(1992)" "(2012)" "(1941)" "(1962)" ... ``` ] .pull-right[ <img src="img/years.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Clean up the text We need to go from `"(1994)"` to `1994`: - Remove `(` and `)`: string manipulation - Convert to numeric: `as.numeric()` --- ## stringr .pull-left-wide[ - **stringr** provides a cohesive set of functions designed to make working with strings as easy as possible - Functions in stringr start with `str_*()`, e.g. - `str_remove()` to remove a pattern from a string ```r str_remove(string = "jello", pattern = "el") ``` ``` ## [1] "jlo" ``` - `str_replace()` to replace a pattern with another ```r str_replace(string = "jello", pattern = "j", replacement = "h") ``` ``` ## [1] "hello" ``` ] .pull-right-narrow[ <img src="img/stringr.png" width="100%" style="display: block; margin: auto auto auto 0;" /> ] --- ## Clean up the text ```r page %>% html_nodes(".secondaryInfo") %>% html_text() %>% str_remove("\\(") # remove ( ``` ``` ## [1] "1994)" "1972)" "2008)" "1974)" "1957)" "1993)" "2003)" ## [8] "1994)" "2001)" "1966)" "1994)" "1999)" "2002)" "2010)" ## [15] "1980)" "1999)" "1990)" "1975)" "1995)" "1946)" "1954)" ## [22] "1991)" "1998)" "2002)" "2014)" "1997)" "1999)" "1977)" ## [29] "1991)" "1985)" "2001)" "2002)" "1960)" "2019)" "1994)" ## [36] "1994)" "2000)" "1998)" "2006)" "2014)" "2006)" "1995)" ## [43] "1942)" "1988)" "1962)" "2011)" "1936)" "1968)" "1988)" ## [50] "1954)" "1979)" "1931)" "1979)" "2000)" "2012)" "1981)" ## [57] "2008)" "2006)" "1950)" "1957)" "1980)" "1940)" "2018)" ## [64] "1957)" "1986)" "2018)" "1999)" "1964)" "2012)" "2003)" ## [71] "2009)" "1984)" "2017)" "1995)" "1995)" "2019)" "1981)" ## [78] "2019)" "1997)" "1984)" "1997)" "2016)" "1952)" "2009)" ## [85] "2000)" "1963)" "2010)" "2018)" "1983)" "1968)" "2004)" ## [92] "1985)" "1992)" "2012)" "1941)" "1962)" "1931)" "1952)" ## [99] "1959)" "1958)" "1960)" "2001)" "1944)" "1971)" "1987)" ... ``` --- ## Clean up the text ```r page %>% html_nodes(".secondaryInfo") %>% html_text() %>% str_remove("\\(") %>% # remove ( str_remove("\\)") # remove ) ``` ``` ## [1] "1994" "1972" "2008" "1974" "1957" "1993" "2003" "1994" ## [9] "2001" "1966" "1994" "1999" "2002" "2010" "1980" "1999" ## [17] "1990" "1975" "1995" "1946" "1954" "1991" "1998" "2002" ## [25] "2014" "1997" "1999" "1977" "1991" "1985" "2001" "2002" ## [33] "1960" "2019" "1994" "1994" "2000" "1998" "2006" "2014" ## [41] "2006" "1995" "1942" "1988" "1962" "2011" "1936" "1968" ## [49] "1988" "1954" "1979" "1931" "1979" "2000" "2012" "1981" ## [57] "2008" "2006" "1950" "1957" "1980" "1940" "2018" "1957" ## [65] "1986" "2018" "1999" "1964" "2012" "2003" "2009" "1984" ## [73] "2017" "1995" "1995" "2019" "1981" "2019" "1997" "1984" ## [81] "1997" "2016" "1952" "2009" "2000" "1963" "2010" "2018" ## [89] "1983" "1968" "2004" "1985" "1992" "2012" "1941" "1962" ## [97] "1931" "1952" "1959" "1958" "1960" "2001" "1944" "1971" ## [105] "1987" "1983" "2020" "2010" "1995" "1962" "2009" "1973" ... ``` --- ## Convert to numeric ```r page %>% html_nodes(".secondaryInfo") %>% html_text() %>% str_remove("\\(") %>% # remove ( str_remove("\\)") %>% # remove ) as.numeric() ``` ``` ## [1] 1994 1972 2008 1974 1957 1993 2003 1994 2001 1966 1994 1999 ## [13] 2002 2010 1980 1999 1990 1975 1995 1946 1954 1991 1998 2002 ## [25] 2014 1997 1999 1977 1991 1985 2001 2002 1960 2019 1994 1994 ## [37] 2000 1998 2006 2014 2006 1995 1942 1988 1962 2011 1936 1968 ## [49] 1988 1954 1979 1931 1979 2000 2012 1981 2008 2006 1950 1957 ## [61] 1980 1940 2018 1957 1986 2018 1999 1964 2012 2003 2009 1984 ## [73] 2017 1995 1995 2019 1981 2019 1997 1984 1997 2016 1952 2009 ## [85] 2000 1963 2010 2018 1983 1968 2004 1985 1992 2012 1941 1962 ## [97] 1931 1952 1959 1958 1960 2001 1944 1971 1987 1983 2020 2010 ## [109] 1995 1962 2009 1973 2011 1927 1976 1997 1988 1989 2000 1948 ## [121] 2007 2019 2022 2004 2016 1965 2005 1921 1959 2020 1950 2013 ## [133] 2018 1961 1985 1995 1998 2007 2006 1992 1999 2010 2001 1993 ## [145] 1961 1948 1975 2007 1963 2003 1950 1982 2021 2003 1980 1974 ... ``` --- ## Save as `years` .pull-left[ ```r years <- page %>% html_nodes(".secondaryInfo") %>% html_text() %>% str_remove("\\(") %>% # remove ( str_remove("\\)") %>% # remove ) as.numeric() years ``` ``` ## [1] 1994 1972 2008 1974 1957 1993 2003 1994 2001 1966 1994 1999 ## [13] 2002 2010 1980 1999 1990 1975 1995 1946 1954 1991 1998 2002 ## [25] 2014 1997 1999 1977 1991 1985 2001 2002 1960 2019 1994 1994 ## [37] 2000 1998 2006 2014 2006 1995 1942 1988 1962 2011 1936 1968 ## [49] 1988 1954 1979 1931 1979 2000 2012 1981 2008 2006 1950 1957 ## [61] 1980 1940 2018 1957 1986 2018 1999 1964 2012 2003 2009 1984 ## [73] 2017 1995 1995 2019 1981 2019 1997 1984 1997 2016 1952 2009 ## [85] 2000 1963 2010 2018 1983 1968 2004 1985 1992 2012 1941 1962 ## [97] 1931 1952 1959 1958 1960 2001 1944 1971 1987 1983 2020 2010 ## [109] 1995 1962 2009 1973 2011 1927 1976 1997 1988 1989 2000 1948 ## [121] 2007 2019 2022 2004 2016 1965 2005 1921 1959 2020 1950 2013 ... ``` ] .pull-right[ <img src="img/years.png" width="100%" style="display: block; margin: auto;" /> ] --- class: middle # Step 4. Scrape IMDB ratings and save as `ratings` --- ## Scrape IMDB ratings <img src="img/ratings.png" width="70%" style="display: block; margin: auto;" /> --- ## Scrape the nodes .pull-left[ ```r page %>% html_nodes("strong") ``` ``` ## {xml_nodeset (250)} ## [1] <strong title="9.2 based on 2,725,518 user ratings">9.2</ ... ## [2] <strong title="9.2 based on 1,894,721 user ratings">9.2</ ... ## [3] <strong title="9.0 based on 2,698,393 user ratings">9.0</ ... ## [4] <strong title="9.0 based on 1,292,145 user ratings">9.0</ ... ## [5] <strong title="9.0 based on 805,990 user ratings">9.0</st ... ## [6] <strong title="8.9 based on 1,376,396 user ratings">8.9</ ... ## [7] <strong title="8.9 based on 1,875,004 user ratings">8.9</ ... ## [8] <strong title="8.8 based on 2,093,728 user ratings">8.8</ ... ## [9] <strong title="8.8 based on 1,904,130 user ratings">8.8</ ... ## [10] <strong title="8.8 based on 772,683 user ratings">8.8</st ... ## [11] <strong title="8.8 based on 2,119,848 user ratings">8.8</ ... ## [12] <strong title="8.7 based on 2,168,662 user ratings">8.7</ ... ## [13] <strong title="8.7 based on 1,692,758 user ratings">8.7</ ... ## [14] <strong title="8.7 based on 2,394,894 user ratings">8.7</ ... ## [15] <strong title="8.7 based on 1,312,085 user ratings">8.7</ ... ## [16] <strong title="8.7 based on 1,944,345 user ratings">8.7</ ... ... ``` ] .pull-right[ <img src="img/ratings.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Extract the text from the nodes .pull-left[ ```r page %>% html_nodes("strong") %>% html_text() ``` ``` ## [1] "9.2" "9.2" "9.0" "9.0" "9.0" "8.9" "8.9" "8.8" "8.8" "8.8" ## [11] "8.8" "8.7" "8.7" "8.7" "8.7" "8.7" "8.7" "8.6" "8.6" "8.6" ## [21] "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.5" "8.5" "8.5" ## [31] "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" ## [41] "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.4" "8.4" "8.4" "8.4" ## [51] "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" ## [61] "8.4" "8.4" "8.4" "8.4" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" ## [71] "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" ## [81] "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" ## [91] "8.3" "8.3" "8.3" "8.3" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" ## [101] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" ## [111] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" ## [121] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" ## [131] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" ## [141] "8.2" "8.2" "8.2" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" ## [151] "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" ... ``` ] .pull-right[ <img src="img/ratings.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Convert to numeric .pull-left[ ```r page %>% html_nodes("strong") %>% html_text() %>% as.numeric() ``` ``` ## [1] 9.2 9.2 9.0 9.0 9.0 8.9 8.9 8.8 8.8 8.8 8.8 8.7 8.7 8.7 8.7 ## [16] 8.7 8.7 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.5 8.5 8.5 ## [31] 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 ## [46] 8.5 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 ## [61] 8.4 8.4 8.4 8.4 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 ## [76] 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 ## [91] 8.3 8.3 8.3 8.3 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 ## [106] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 ## [121] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 ## [136] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ## [151] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ## [166] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ## [181] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ## [196] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ## [211] 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 ... ``` ] .pull-right[ <img src="img/ratings.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Save as `ratings` .pull-left[ ```r ratings <- page %>% html_nodes("strong") %>% html_text() %>% as.numeric() ratings ``` ``` ## [1] 9.2 9.2 9.0 9.0 9.0 8.9 8.9 8.8 8.8 8.8 8.8 8.7 8.7 8.7 8.7 ## [16] 8.7 8.7 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.5 8.5 8.5 ## [31] 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 ## [46] 8.5 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 ## [61] 8.4 8.4 8.4 8.4 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 ## [76] 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 ## [91] 8.3 8.3 8.3 8.3 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 ## [106] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 ## [121] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 ## [136] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ## [151] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ## [166] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ... ``` ] .pull-right[ <img src="img/ratings.png" width="100%" style="display: block; margin: auto;" /> ] --- class: middle # Step 5. Create a data frame called `imdb_top_250` --- ## Create a data frame: `imdb_top_250` ```r imdb_top_250 <- tibble( title = titles, year = years, rating = ratings ) imdb_top_250 ``` ``` ## # A tibble: 250 x 3 ## title year rating ## <chr> <dbl> <dbl> ## 1 The Shawshank Redemption 1994 9.2 ## 2 The Godfather 1972 9.2 ## 3 The Dark Knight 2008 9 ## 4 The Godfather Part II 1974 9 ## 5 12 Angry Men 1957 9 ## 6 Schindler's List 1993 8.9 ## # ... with 244 more rows ``` ---
--- ## Clean up / enhance May or may not be a lot of work depending on how messy the data are - See if you like what you got: ```r glimpse(imdb_top_250) ``` ``` ## Rows: 250 ## Columns: 3 ## $ title <chr> "The Shawshank Redemption", "The Godfather", "Th~ ## $ year <dbl> 1994, 1972, 2008, 1974, 1957, 1993, 2003, 1994, ~ ## $ rating <dbl> 9.2, 9.2, 9.0, 9.0, 9.0, 8.9, 8.9, 8.8, 8.8, 8.8~ ``` - Add a variable for rank ```r imdb_top_250 <- imdb_top_250 %>% mutate(rank = 1:nrow(imdb_top_250)) %>% relocate(rank) ``` --- ``` ## # A tibble: 250 x 4 ## rank title year rating ## <int> <chr> <dbl> <dbl> ## 1 1 The Shawshank Redemption 1994 9.2 ## 2 2 The Godfather 1972 9.2 ## 3 3 The Dark Knight 2008 9 ## 4 4 The Godfather Part II 1974 9 ## 5 5 12 Angry Men 1957 9 ## 6 6 Schindler's List 1993 8.9 ## 7 7 The Lord of the Rings: The Return of the K~ 2003 8.9 ## 8 8 Pulp Fiction 1994 8.8 ## 9 9 The Lord of the Rings: The Fellowship of t~ 2001 8.8 ## 10 10 The Good, the Bad and the Ugly 1966 8.8 ## 11 11 Forrest Gump 1994 8.8 ## 12 12 Fight Club 1999 8.7 ## 13 13 The Lord of the Rings: The Two Towers 2002 8.7 ## 14 14 Inception 2010 8.7 ## 15 15 Star Wars: Episode V - The Empire Strikes ~ 1980 8.7 ## 16 16 The Matrix 1999 8.7 ## 17 17 Goodfellas 1990 8.7 ## 18 18 One Flew Over the Cuckoo's Nest 1975 8.6 ## 19 19 Se7en 1995 8.6 ## 20 20 It's a Wonderful Life 1946 8.6 ## # ... with 230 more rows ``` --- class: middle # What next? --- .question[ Which years have the most movies on the list? ] -- ```r imdb_top_250 %>% count(year, sort = TRUE) ``` ``` ## # A tibble: 87 x 2 ## year n ## <dbl> <int> ## 1 1995 8 ## 2 2004 7 ## 3 1957 6 ## 4 1999 6 ## 5 2003 6 ## 6 2009 6 ## # ... with 81 more rows ``` --- .question[ Which 1995 movies made the list? ] -- ```r imdb_top_250 %>% filter(year == 1995) %>% print(n = 8) ``` ``` ## # A tibble: 8 x 4 ## rank title year rating ## <int> <chr> <dbl> <dbl> ## 1 19 Se7en 1995 8.6 ## 2 42 The Usual Suspects 1995 8.5 ## 3 74 Toy Story 1995 8.3 ## 4 75 Braveheart 1995 8.3 ## 5 109 Heat 1995 8.2 ## 6 136 Casino 1995 8.2 ## 7 181 Before Sunrise 1995 8.1 ## 8 228 La haine 1995 8 ``` --- .question[ Visualize the average yearly rating for movies that made it on the top 250 list over time. ] -- .panelset[ .panel[.panel-name[Plot] <img src="u2-d19-top-250-imdb_files/figure-html/unnamed-chunk-46-1.png" width="58%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r imdb_top_250 %>% group_by(year) %>% summarise(avg_score = mean(rating)) %>% ggplot(aes(y = avg_score, x = year)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(x = "Year", y = "Average score") ``` ] ] --- ## Acknowledgements * This course builds on the materials from [Data Science in a Box](https://datasciencebox.org/) developed by Mine Çetinkaya-Rundel and are adapted under the [Creative Commons Attribution Share Alike 4.0 International](https://github.com/rstudio-education/datascience-box/blob/master/LICENSE.md)