class: center, middle, inverse, title-slide # Fitting and interpreting models ##
College of the Atlantic --- class: middle # Models with numerical explanatory variables --- ## Data: Paris Paintings ```r pp <- read_csv("data/paris-paintings.csv", na = c("n/a", "", "NA")) ``` - Number of observations: 3393 - Number of variables: 61 --- ## Goal: Predict height from width `$$\widehat{height}_{i} = \beta_0 + \beta_1 \times width_{i}$$` <img src="u4-d02-fitting-interpreting-models_files/figure-html/height-width-plot-1.png" width="60%" style="display: block; margin: auto;" /> --- <img src="img/tidymodels.png" width="98%" style="display: block; margin: auto;" /> --- ## Step 1: Specify model ```r linear_reg() ``` ``` ## Linear Regression Model Specification (regression) ## ## Computational engine: lm ``` --- ## Step 2: Set model fitting *engine* ```r linear_reg() %>% set_engine("lm") # lm: linear model ``` ``` ## Linear Regression Model Specification (regression) ## ## Computational engine: lm ``` --- ## Step 3: Fit model & estimate parameters ... using **formula syntax** ```r linear_reg() %>% set_engine("lm") %>% fit(Height_in ~ Width_in, data = pp) ``` ``` ## parsnip model object ## ## ## Call: ## stats::lm(formula = Height_in ~ Width_in, data = data) ## ## Coefficients: ## (Intercept) Width_in ## 3.6214 0.7808 ``` --- ## A closer look at model output ``` ## parsnip model object ## ## ## Call: ## stats::lm(formula = Height_in ~ Width_in, data = data) ## ## Coefficients: ## (Intercept) Width_in ## 3.6214 0.7808 ``` .large[ `$$\widehat{height}_{i} = 3.6214 + 0.7808 \times width_{i}$$` ] --- ## A tidy look at model output ```r linear_reg() %>% set_engine("lm") %>% fit(Height_in ~ Width_in, data = pp) %>% tidy() ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 3.62 0.254 14.3 8.82e-45 ## 2 Width_in 0.781 0.00950 82.1 0 ``` .large[ `$$\widehat{height}_{i} = 3.62 + 0.781 \times width_{i}$$` ] --- ## Slope and intercept .large[ `$$\widehat{height}_{i} = 3.62 + 0.781 \times width_{i}$$` ] -- - **Slope:** For each additional inch the painting is wider, the height is expected to be higher, on average, by 0.781 inches. -- - **Intercept:** Paintings that are 0 inches wide are expected to be 3.62 inches high, on average. (Does this make sense?) --- ## Correlation does not imply causation Remember this when interpreting model coefficients <img src="img/cell_phones.png" width="90%" style="display: block; margin: auto;" /> .footnote[ Source: XKCD, [Cell phones](https://xkcd.com/925/) ] --- class: middle # Parameter estimation --- ## Linear model with a single predictor - We're interested in `\(\beta_0\)` (population parameter for the intercept) and `\(\beta_1\)` (population parameter for the slope) in the following model: `$$\hat{y}_{i} = \beta_0 + \beta_1~x_{i}$$` -- - Tough luck, you can't have them... -- - So we use sample statistics to estimate them: `$$\hat{y}_{i} = b_0 + b_1~x_{i}$$` --- ## Least squares regression - The regression line minimizes the sum of squared residuals. -- - If `\(e_i = y_i - \hat{y}_i\)`, then, the regression line minimizes `\(\sum_{i = 1}^n e_i^2\)`. --- ## Visualizing residuals <img src="u4-d02-fitting-interpreting-models_files/figure-html/vis-res-1-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Visualizing residuals (cont.) <img src="u4-d02-fitting-interpreting-models_files/figure-html/vis-res-2-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Visualizing residuals (cont.) <img src="u4-d02-fitting-interpreting-models_files/figure-html/vis-res-3-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Properties of least squares regression - The regression line goes through the center of mass point, the coordinates corresponding to average `\(x\)` and average `\(y\)`, `\((\bar{x}, \bar{y})\)`: `$$\bar{y} = b_0 + b_1 \bar{x} ~ \rightarrow ~ b_0 = \bar{y} - b_1 \bar{x}$$` -- - The slope has the same sign as the correlation coefficient: `\(b_1 = r \frac{s_y}{s_x}\)` -- - The sum of the residuals is zero: `\(\sum_{i = 1}^n e_i = 0\)` -- - The residuals and `\(x\)` values are uncorrelated --- class: middle # Model checking --- ## Data: Paris Paintings ```r pp <- read_csv("data/paris-paintings.csv", na = c("n/a", "", "NA")) ``` - Number of observations: 3393 - Number of variables: 61 --- ## "Linear" models - We're fitting a "linear" model, which assumes a linear relationship between our explanatory and response variables. - But how do we assess this? --- ## Graphical diagnostic: residuals plot .panelset[ .panel[.panel-name[Plot] <img src="u4-d02-fitting-interpreting-models_files/figure-html/unnamed-chunk-10-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r ht_wt_fit <- linear_reg() %>% set_engine("lm") %>% fit(Height_in ~ Width_in, data = pp) *ht_wt_fit_aug <- augment(ht_wt_fit$fit) ggplot(ht_wt_fit_aug, mapping = aes(x = .fitted, y = .resid)) + geom_point(alpha = 0.5) + geom_hline(yintercept = 0, color = "gray", lty = "dashed") + labs(x = "Predicted height", y = "Residuals") ``` ] .panel[.panel-name[Augment] ```r ht_wt_fit_aug ``` ``` ## # A tibble: 3,135 x 9 ## .rown~1 Heigh~2 Width~3 .fitted .resid .hat .sigma .cooksd ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 37 29.5 26.7 10.3 3.99e-4 8.30 3.10e-4 ## 2 2 18 14 14.6 3.45 3.96e-4 8.31 3.42e-5 ## 3 3 13 16 16.1 -3.11 3.61e-4 8.31 2.54e-5 ## 4 4 14 18 17.7 -3.68 3.37e-4 8.31 3.30e-5 ## 5 5 14 18 17.7 -3.68 3.37e-4 8.31 3.30e-5 ## 6 6 7 10 11.4 -4.43 4.98e-4 8.31 7.09e-5 ## 7 7 6 13 13.8 -7.77 4.18e-4 8.30 1.83e-4 ## 8 8 6 13 13.8 -7.77 4.18e-4 8.30 1.83e-4 ## 9 9 15 15 15.3 -0.333 3.77e-4 8.31 3.04e-7 ## 10 10 9 7 9.09 -0.0870 6.01e-4 8.31 3.30e-8 ## # ... with 3,125 more rows, 1 more variable: .std.resid <dbl>, ## # and abbreviated variable names 1: .rownames, 2: Height_in, ## # 3: Width_in ``` ] ] --- ## More on `augment()` ```r glimpse(ht_wt_fit_aug) ``` ``` ## Rows: 3,135 ## Columns: 9 ## $ .rownames <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9",~ ## $ Height_in <dbl> 37, 18, 13, 14, 14, 7, 6, 6, 15, 9, 9, 16, 1~ ## $ Width_in <dbl> 29.5, 14.0, 16.0, 18.0, 18.0, 10.0, 13.0, 13~ ## $ .fitted <dbl> 26.65490, 14.55256, 16.11415, 17.67574, 17.6~ ## $ .resid <dbl> 10.3451004, 3.4474447, -3.1141481, -3.675740~ ## $ .hat <dbl> 0.0003991488, 0.0003961825, 0.0003611963, 0.~ ## $ .sigma <dbl> 8.303538, 8.305367, 8.305409, 8.305336, 8.30~ ## $ .cooksd <dbl> 3.099689e-04, 3.416655e-05, 2.541574e-05, 3.~ ## $ .std.resid <dbl> 1.24600543, 0.41522347, -0.37507338, -0.4427~ ``` --- ## Looking for... - Residuals distributed randomly around 0 - With no visible pattern along the x or y axes <img src="u4-d02-fitting-interpreting-models_files/figure-html/unnamed-chunk-13-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Not looking for... .large[ **Fan shapes** ] <img src="u4-d02-fitting-interpreting-models_files/figure-html/unnamed-chunk-14-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Not looking for... .large[ **Groups of patterns** ] <img src="u4-d02-fitting-interpreting-models_files/figure-html/unnamed-chunk-15-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Not looking for... .large[ **Residuals correlated with predicted values** ] <img src="u4-d02-fitting-interpreting-models_files/figure-html/unnamed-chunk-16-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Not looking for... .large[ **Any patterns!** ] <img src="u4-d02-fitting-interpreting-models_files/figure-html/unnamed-chunk-17-1.png" width="60%" style="display: block; margin: auto;" /> --- .question[ What patterns does the residuals plot reveal that should make us question whether the current model is a good fit for modeling the height of the paintings? ] <img src="u4-d02-fitting-interpreting-models_files/figure-html/unnamed-chunk-18-1.png" width="60%" style="display: block; margin: auto;" /> --- class: middle # Models with categorical explanatory variables --- ## Categorical predictor with 2 levels .pull-left-narrow[ .small[ ``` ## # A tibble: 3,393 x 3 ## name Height_in landsALL ## <chr> <dbl> <dbl> ## 1 L1764-2 37 0 ## 2 L1764-3 18 0 ## 3 L1764-4 13 1 ## 4 L1764-5a 14 1 ## 5 L1764-5b 14 1 ## 6 L1764-6 7 0 ## 7 L1764-7a 6 0 ## 8 L1764-7b 6 0 ## 9 L1764-8 15 0 ## 10 L1764-9a 9 0 ## 11 L1764-9b 9 0 ## 12 L1764-10a 16 1 ## 13 L1764-10b 16 1 ## 14 L1764-10c 16 1 ## 15 L1764-11 20 0 ## 16 L1764-12a 14 1 ## 17 L1764-12b 14 1 ## 18 L1764-13a 15 1 ## 19 L1764-13b 15 1 ## 20 L1764-14 37 0 ## # ... with 3,373 more rows ``` ] ] .pull-right-wide[ - `landsALL = 0`: No landscape features - `landsALL = 1`: Some landscape features ] --- ## Height & landscape features ```r linear_reg() %>% set_engine("lm") %>% fit(Height_in ~ factor(landsALL), data = pp) %>% tidy() ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 22.7 0.328 69.1 0 ## 2 factor(landsALL)1 -5.65 0.532 -10.6 7.97e-26 ``` --- ## Height & landscape features `$$\widehat{Height_{in}} = 22.7 - 5.645~landsALL$$` - **Intercept:** Paintings that don't have landscape features are expected, on average, to be 22.7 inches tall - **Landscapes** Paintings with landscape features are expected, on average, to be 5.645 inches shorter than paintings that without landscape features - Compares baseline level (`landsALL = 0`) to the other level (`landsALL = 1`) --- ## Graphical diagnostic: residuals plot .panelset[ .panel[.panel-name[Plot] <img src="u4-d02-fitting-interpreting-models_files/figure-html/unnamed-chunk-20-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r ht_land_fit <- linear_reg() %>% set_engine("lm") %>% fit(Height_in ~ factor(landsALL), data = pp) *ht_land_fit_aug <- augment(ht_land_fit$fit) ggplot(ht_land_fit_aug, mapping = aes(x = .fitted, y = .resid)) + geom_point(alpha = 0.5) + geom_hline(yintercept = 0, color = "gray", lty = "dashed") + labs(x = "Predicted height", y = "Residuals") ``` ] .panel[.panel-name[Augment] ```r ht_land_fit_aug ``` ``` ## # A tibble: 3,141 x 9 ## .rowna~1 Heigh~2 facto~3 .fitted .resid .hat .sigma .cooksd ## <chr> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 37 0 22.7 14.3 5.13e-4 14.5 2.51e-4 ## 2 2 18 0 22.7 -4.68 5.13e-4 14.5 2.68e-5 ## 3 3 13 1 17.0 -4.03 8.39e-4 14.5 3.26e-5 ## 4 4 14 1 17.0 -3.03 8.39e-4 14.5 1.85e-5 ## 5 5 14 1 17.0 -3.03 8.39e-4 14.5 1.85e-5 ## 6 6 7 0 22.7 -15.7 5.13e-4 14.5 3.01e-4 ## 7 7 6 0 22.7 -16.7 5.13e-4 14.5 3.41e-4 ## 8 8 6 0 22.7 -16.7 5.13e-4 14.5 3.41e-4 ## 9 9 15 0 22.7 -7.68 5.13e-4 14.5 7.22e-5 ## 10 10 9 0 22.7 -13.7 5.13e-4 14.5 2.29e-4 ## # ... with 3,131 more rows, 1 more variable: .std.resid <dbl>, ## # and abbreviated variable names 1: .rownames, 2: Height_in, ## # 3: `factor(landsALL)` ``` ] ] --- ## Height, Width & landscape features ```r linear_reg() %>% set_engine("lm") %>% fit(Height_in ~ factor(landsALL) + Width_in, data = pp) %>% tidy() ``` ``` ## # A tibble: 3 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 5.62 0.269 20.9 1.24e-90 ## 2 factor(landsALL)1 -5.02 0.292 -17.2 3.70e-63 ## 3 Width_in 0.777 0.00909 85.4 0 ``` --- ## Height & landscape features `$$\widehat{Height_{in}} = 5.615 - 5.017~landsALL + 0.777 \times width_{i}$$` - **Intercept:** Paintings that don't have landscape features are expected, on average, to be 5.615 inches tall - **Landscapes** Paintings with landscape features are expected, on average, to be 5.017 inches shorter than paintings that without landscape features - Compares baseline level (`landsALL = 0`) to the other level (`landsALL = 1`) - **Slope:** For every inch in width, paintings are expected to be 0.777 inches taller. --- ## Graphical diagnostic: residuals plot .panelset[ .panel[.panel-name[Plot] <img src="u4-d02-fitting-interpreting-models_files/figure-html/unnamed-chunk-22-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r ht_wt_land_fit <- linear_reg() %>% set_engine("lm") %>% fit(Height_in ~ factor(landsALL) + Width_in, data = pp) *ht_wt_land_fit_aug <- augment(ht_wt_land_fit$fit) ggplot(ht_wt_land_fit_aug, mapping = aes(x = .fitted, y = .resid)) + geom_point(alpha = 0.5) + geom_hline(yintercept = 0, color = "gray", lty = "dashed") + labs(x = "Predicted height", y = "Residuals") ``` ] .panel[.panel-name[Augment] ```r ht_wt_land_fit_aug ``` ``` ## # A tibble: 3,135 x 10 ## .rown~1 Heigh~2 facto~3 Width~4 .fitted .resid .hat .sigma ## <chr> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 37 0 29.5 28.5 8.47 5.88e-4 7.94 ## 2 2 18 0 14 16.5 1.51 5.98e-4 7.94 ## 3 3 13 1 16 13.0 -0.0251 8.75e-4 7.94 ## 4 4 14 1 18 14.6 -0.578 8.53e-4 7.94 ## 5 5 14 1 18 14.6 -0.578 8.53e-4 7.94 ## 6 6 7 0 10 13.4 -6.38 7.03e-4 7.94 ## 7 7 6 0 13 15.7 -9.71 6.20e-4 7.94 ## 8 8 6 0 13 15.7 -9.71 6.20e-4 7.94 ## 9 9 15 0 15 17.3 -2.27 5.78e-4 7.94 ## 10 10 9 0 7 11.1 -2.05 8.09e-4 7.94 ## # ... with 3,125 more rows, 2 more variables: .cooksd <dbl>, ## # .std.resid <dbl>, and abbreviated variable names ## # 1: .rownames, 2: Height_in, 3: `factor(landsALL)`, ## # 4: Width_in ``` ] ] --- ## Logging Height, Graphical diagnostic: residuals plot .panelset[ .panel[.panel-name[Plot] <img src="u4-d02-fitting-interpreting-models_files/figure-html/unnamed-chunk-24-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r ht_wt_land_fit <- linear_reg() %>% set_engine("lm") %>% fit(log(Height_in) ~ factor(landsALL) + Width_in, data = pp) *ht_wt_land_fit_aug <- augment(ht_wt_land_fit$fit) ggplot(ht_wt_land_fit_aug, mapping = aes(x = .fitted, y = .resid)) + geom_point(alpha = 0.5) + geom_hline(yintercept = 0, color = "gray", lty = "dashed") + labs(x = "Predicted logged height", y = "Residuals") ``` ] .panel[.panel-name[Augment] ```r ht_wt_land_fit_aug ``` ``` ## # A tibble: 3,135 x 10 ## .rown~1 log(H~2 facto~3 Width~4 .fitted .resid .hat .sigma ## <chr> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 3.61 0 29.5 3.15 0.456 5.88e-4 0.400 ## 2 2 2.89 0 14 2.64 0.248 5.98e-4 0.400 ## 3 3 2.56 1 16 2.47 0.0966 8.75e-4 0.400 ## 4 4 2.64 1 18 2.53 0.105 8.53e-4 0.400 ## 5 5 2.64 1 18 2.53 0.105 8.53e-4 0.400 ## 6 6 1.95 0 10 2.51 -0.564 7.03e-4 0.400 ## 7 7 1.79 0 13 2.61 -0.818 6.20e-4 0.400 ## 8 8 1.79 0 13 2.61 -0.818 6.20e-4 0.400 ## 9 9 2.71 0 15 2.68 0.0326 5.78e-4 0.400 ## 10 10 2.20 0 7 2.41 -0.214 8.09e-4 0.400 ## # ... with 3,125 more rows, 2 more variables: .cooksd <dbl>, ## # .std.resid <dbl>, and abbreviated variable names ## # 1: .rownames, 2: `log(Height_in)`, 3: `factor(landsALL)`, ## # 4: Width_in ``` ] ] --- ## Residuals Summary - Non-constant variance is one of the most common model violations, however it is usually fixable by transforming the response (y) variable. -- - The most common transformation when the response variable is right skewed is the log transform: `\(log(y)\)`, especially useful when the response variable is (extremely) right skewed. -- - This transformation is also useful for variance stabilization. -- - When using a log transformation on the response variable the interpretation of the slope changes: *"For each unit increase in x, y is expected on average to be higher/lower <br> by a factor of `\(e^{b_1}\)`."* -- - Another useful transformation is the square root: `\(\sqrt{y}\)`, especially useful when the response variable is counts. --- class: inverse # Additional Slides --- ## Relationship between height and school ```r linear_reg() %>% set_engine("lm") %>% fit(Height_in ~ school_pntg, data = pp) %>% tidy() ``` ``` ## # A tibble: 7 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 14.0 10.0 1.40 0.162 ## 2 school_pntgD/FL 2.33 10.0 0.232 0.816 ## 3 school_pntgF 10.2 10.0 1.02 0.309 ## 4 school_pntgG 1.65 11.9 0.139 0.889 ## 5 school_pntgI 10.3 10.0 1.02 0.306 ## 6 school_pntgS 30.4 11.4 2.68 0.00744 ## 7 school_pntgX 2.87 10.3 0.279 0.780 ``` --- ## Dummy variables ``` ## # A tibble: 7 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 14.0 10.0 1.40 0.162 ## 2 school_pntgD/FL 2.33 10.0 0.232 0.816 ## 3 school_pntgF 10.2 10.0 1.02 0.309 ## 4 school_pntgG 1.65 11.9 0.139 0.889 ## 5 school_pntgI 10.3 10.0 1.02 0.306 ## 6 school_pntgS 30.4 11.4 2.68 0.00744 ## 7 school_pntgX 2.87 10.3 0.279 0.780 ``` - When the categorical explanatory variable has many levels, they're encoded to **dummy variables** - Each coefficient describes the expected difference between heights in that particular school compared to the baseline level --- ## Categorical predictor with 3+ levels .pull-left-wide[ <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> school_pntg </th> <th style="text-align:center;"> D_FL </th> <th style="text-align:center;"> F </th> <th style="text-align:center;"> G </th> <th style="text-align:center;"> I </th> <th style="text-align:center;"> S </th> <th style="text-align:center;"> X </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> </tr> <tr> <td style="text-align:left;"> D/FL </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(122, 209, 81, 1) !important;"> 1 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> </tr> <tr> <td style="text-align:left;"> F </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(122, 209, 81, 1) !important;"> 1 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> </tr> <tr> <td style="text-align:left;"> G </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(122, 209, 81, 1) !important;"> 1 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> </tr> <tr> <td style="text-align:left;"> I </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(122, 209, 81, 1) !important;"> 1 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> </tr> <tr> <td style="text-align:left;"> S </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(122, 209, 81, 1) !important;"> 1 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> </tr> <tr> <td style="text-align:left;"> X </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(68, 1, 84, 1) !important;"> 0 </td> <td style="text-align:center;width: 10em; color: white !important;background-color: rgba(122, 209, 81, 1) !important;"> 1 </td> </tr> </tbody> </table> ] .pull-right-narrow[ .small[ ``` ## # A tibble: 3,393 x 3 ## name Height_in school_pntg ## <chr> <dbl> <chr> ## 1 L1764-2 37 F ## 2 L1764-3 18 I ## 3 L1764-4 13 D/FL ## 4 L1764-5a 14 F ## 5 L1764-5b 14 F ## 6 L1764-6 7 I ## 7 L1764-7a 6 F ## 8 L1764-7b 6 F ## 9 L1764-8 15 I ## 10 L1764-9a 9 D/FL ## 11 L1764-9b 9 D/FL ## 12 L1764-10a 16 X ## 13 L1764-10b 16 X ## 14 L1764-10c 16 X ## 15 L1764-11 20 D/FL ## 16 L1764-12a 14 D/FL ## 17 L1764-12b 14 D/FL ## 18 L1764-13a 15 D/FL ## 19 L1764-13b 15 D/FL ## 20 L1764-14 37 F ## # ... with 3,373 more rows ``` ] ] --- ## Relationship between height and school .small[ ``` ## # A tibble: 7 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 14.0 10.0 1.40 0.162 ## 2 school_pntgD/FL 2.33 10.0 0.232 0.816 ## 3 school_pntgF 10.2 10.0 1.02 0.309 ## 4 school_pntgG 1.65 11.9 0.139 0.889 ## 5 school_pntgI 10.3 10.0 1.02 0.306 ## 6 school_pntgS 30.4 11.4 2.68 0.00744 ## 7 school_pntgX 2.87 10.3 0.279 0.780 ``` - **Austrian school (A)** paintings are expected, on average, to be **14 inches** tall. - **Dutch/Flemish school (D/FL)** paintings are expected, on average, to be **2.33 inches taller** than *Austrian school* paintings. - **French school (F)** paintings are expected, on average, to be **10.2 inches taller** than *Austrian school* paintings. - **German school (G)** paintings are expected, on average, to be **1.65 inches taller** than *Austrian school* paintings. - **Italian school (I)** paintings are expected, on average, to be **10.3 inches taller** than *Austrian school* paintings. - **Spanish school (S)** paintings are expected, on average, to be **30.4 inches taller** than *Austrian school* paintings. - Paintings whose school is **unknown (X)** are expected, on average, to be **2.87 inches taller** than *Austrian school* paintings. ] --- ## Acknowledgements * This course builds on the materials from [Data Science in a Box](https://datasciencebox.org/) developed by Mine Çetinkaya-Rundel and are adapted under the [Creative Commons Attribution Share Alike 4.0 International](https://github.com/rstudio-education/datascience-box/blob/master/LICENSE.md)