class: center, middle, inverse, title-slide .title[ # Visualizing Missing Data ] .subtitle[ ##
College of the Atlantic ] .author[ ### Laurie Baker ] --- class: middle # Visualizing Missing Data --- ## Origins of Missing Data - What are the origins of missing data? --- ## Packages for visualizing missing data ```r library(visdat) library(naniar) library(UpSetR) ``` --- ## Inspecting the data types Let's explore a data set of imaginary people who have been randomly generated. ```r glimpse(typical_data) ``` ``` ## Rows: 5,000 ## Columns: 9 ## $ ID <chr> "0001", "0002", "0003", "0004", "0005", "0… ## $ Race <fct> Black, Black, Black, Hispanic, NA, White, … ## $ Age <dbl> NA, 25, 31, 27, 21, 22, 23, 21, NA, 27, 22… ## $ Sex <fct> Male, Male, Female, Female, Female, Female… ## $ `Height(cm)` <dbl> 175.9, 171.7, 173.5, 172.4, 158.5, 169.5, … ## $ IQ <dbl> 110, 84, 115, 84, 116, 83, 101, 97, 92, 99… ## $ Smokes <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, F… ## $ Income <dbl> 91, 877, 3190, 4239, 3995, 557, 273, 1133,… ## $ Died <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, F… ``` * Which of the variable's data types have likely been incorrectly identified? .footnote[For more info about the randomly generated data run `?typical_data`] --- ## Summarizing Missing data ```r summary(typical_data) ``` ``` ## ID Race Age ## Length:5000 White :2938 Min. :20.00 ## Class :character Hispanic : 745 1st Qu.:23.00 ## Mode :character Black : 525 Median :27.00 ## Asian : 181 Mean :27.47 ## Bi-Racial: 74 3rd Qu.:31.25 ## (Other) : 37 Max. :35.00 ## NA's : 500 NA's :500 ## Sex Height(cm) IQ Smokes ## Male :2499 Min. :140.3 Min. : 60.0 Mode :logical ## Female:2501 1st Qu.:168.8 1st Qu.: 93.0 FALSE:4085 ## Median :175.2 Median :100.0 TRUE :915 ## Mean :175.2 Mean :100.1 ## 3rd Qu.:181.8 3rd Qu.:107.0 ## Max. :208.9 Max. :137.0 ## NA's :500 ## Income Died ## Min. : 1 Mode :logical ## 1st Qu.:1126 FALSE:2474 ## Median :2250 TRUE :2526 ## Mean :2250 ## 3rd Qu.:3374 ## Max. :4499 ## NA's :500 ``` * `summary` can give us some ideas of missing data --- ## Visualizing Data Types ```r vis_dat(typical_data) ``` <img src="u5-d09-viz-miss_files/figure-html/vis data types-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Visualizing Missing Data ```r vis_miss(typical_data) ``` <img src="u5-d09-viz-miss_files/figure-html/vis missing data-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Calculating how much is missing ```r # From nanianr package: n_miss(typical_data) # number missing ``` ``` ## [1] 2000 ``` ```r n_complete(typical_data) # number complete ``` ``` ## [1] 43000 ``` ```r pct_miss(typical_data) # percent missing ``` ``` ## [1] 4.444444 ``` --- # Calculating how much is missing by column ```r miss_var_summary(typical_data) ``` ``` ## # A tibble: 9 × 3 ## variable n_miss pct_miss ## <chr> <int> <dbl> ## 1 Race 500 10 ## 2 Age 500 10 ## 3 IQ 500 10 ## 4 Income 500 10 ## 5 ID 0 0 ## 6 Sex 0 0 ## # ℹ 3 more rows ``` --- # Visualizing missing data by variable ```r gg_miss_var(typical_data) ``` <img src="u5-d09-viz-miss_files/figure-html/plot missing data by variable-1.png" width="60%" style="display: block; margin: auto;" /> --- # Visualizing percent missing data by variable ```r gg_miss_var(typical_data, show_pct = TRUE) + ylim(0, 100) ``` <img src="u5-d09-viz-miss_files/figure-html/plot percent missing-1.png" width="60%" style="display: block; margin: auto;" /> --- # Visualizing missing data by another variable ```r gg_miss_var(typical_data, facet = Race) ``` <img src="u5-d09-viz-miss_files/figure-html/plot missing by facet-1.png" width="60%" style="display: block; margin: auto;" /> --- # Missingness across factors ```r gg_miss_fct(x = typical_data, fct = Race) + labs(y = "Variable") ``` <img src="u5-d09-viz-miss_files/figure-html/plot missing percent heatmap-1.png" width="60%" style="display: block; margin: auto;" /> --- # Missingness across cases ```r gg_miss_case(typical_data) ``` <img src="u5-d09-viz-miss_files/figure-html/plot missing cases-1.png" width="60%" style="display: block; margin: auto;" /> --- # Missingness across cases ```r typical_data %>% as_shadow_upset() %>% upset() ``` <img src="u5-d09-viz-miss_files/figure-html/understanding missing case frequency-1.png" width="60%" style="display: block; margin: auto;" /> --- # Daily Air Quality Data in NYC, May-Sept 1973 ```r vis_dat(airquality) ``` <img src="u5-d09-viz-miss_files/figure-html/visualizing data types of airquality-1.png" width="60%" style="display: block; margin: auto;" /> --- # Returning to the `ggplot` warnings ```r ggplot(airquality, aes(x = Solar.R, y = Ozone)) + geom_point() ``` ``` ## Warning: Removed 42 rows containing missing values ## (`geom_point()`). ``` <img src="u5-d09-viz-miss_files/figure-html/scatterplot of Solar radiation and Ozone-1.png" width="60%" style="display: block; margin: auto;" /> --- # Relationships in missing values ```r airquality %>% ggplot(aes(x = Ozone, y = Solar.R)) + geom_miss_point() ``` <img src="u5-d09-viz-miss_files/figure-html/plot of missing points-1.png" width="60%" style="display: block; margin: auto;" /> --- # Relationships in missing values ```r airquality %>% ggplot(aes(x = Ozone, y = Solar.R)) + geom_miss_point() + facet_wrap(~Month) ``` <img src="u5-d09-viz-miss_files/figure-html/plot missing relationships with facets-1.png" width="60%" style="display: block; margin: auto;" /> --- # Visualizing relationships in missing variables ```r airquality %>% bind_shadow() %>% ggplot(aes(x = Solar.R, fill = Ozone_NA)) + geom_histogram(binwidth = 20) ``` ``` ## Warning: Removed 7 rows containing non-finite values ## (`stat_bin()`). ``` <img src="u5-d09-viz-miss_files/figure-html/visualizing relationships in missing variables-1.png" width="60%" style="display: block; margin: auto;" /> --- # Consequences of Missing Data - So you have data that is missing? What now? --- # Your Turn <img src="img/Sasquatch.svg" alt="Outline of Sasquatch" width="60%" style="display: block; margin: auto;" /> Your Turn: <a href="https://commons.wikimedia.org/wiki/File:Sasquatch.svg">Happybluemo</a>, <a href="https://creativecommons.org/licenses/by-sa/4.0">CC BY-SA 4.0</a>, via Wikimedia Commons --- # Thank you