class: center, middle, inverse, title-slide # Modeling non-linear relationships ##
Data Science in a Box ###
datasciencebox.org
--- layout: true <div class="my-footer"> <span> <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- class: middle # Model checking --- ## Data: Paris Paintings ```r pp <- read_csv("data/paris-paintings.csv", na = c("n/a", "", "NA")) ``` - Number of observations: 3393 - Number of variables: 61 --- ## "Linear" models - We're fitting a "linear" model, which assumes a linear relationship between our explanatory and response variables. - But how do we assess this? --- ## Graphical diagnostic: residuals plot .panelset[ .panel[.panel-name[Plot] <img src="u4-d03-modeling-nonlinear-relationships_files/figure-html/unnamed-chunk-3-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r ht_wt_fit <- linear_reg() %>% set_engine("lm") %>% fit(Height_in ~ Width_in, data = pp) *ht_wt_fit_aug <- augment(ht_wt_fit$fit) ggplot(ht_wt_fit_aug, mapping = aes(x = .fitted, y = .resid)) + geom_point(alpha = 0.5) + geom_hline(yintercept = 0, color = "gray", lty = "dashed") + labs(x = "Predicted height", y = "Residuals") ``` ] .panel[.panel-name[Augment] ```r ht_wt_fit_aug ``` ``` ## # A tibble: 3,135 x 9 ## .rownames Height_in Width_in .fitted .resid .hat .sigma ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 37 29.5 26.7 10.3 0.000399 8.30 ## 2 2 18 14 14.6 3.45 0.000396 8.31 ## 3 3 13 16 16.1 -3.11 0.000361 8.31 ## 4 4 14 18 17.7 -3.68 0.000337 8.31 ## 5 5 14 18 17.7 -3.68 0.000337 8.31 ## 6 6 7 10 11.4 -4.43 0.000498 8.31 ## 7 7 6 13 13.8 -7.77 0.000418 8.30 ## 8 8 6 13 13.8 -7.77 0.000418 8.30 ## 9 9 15 15 15.3 -0.333 0.000377 8.31 ## 10 10 9 7 9.09 -0.0870 0.000601 8.31 ## # … with 3,125 more rows, and 2 more variables: .cooksd <dbl>, ## # .std.resid <dbl> ``` ] ] --- ## More on `augment()` ```r glimpse(ht_wt_fit_aug) ``` ``` ## Rows: 3,135 ## Columns: 9 ## $ .rownames <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9",… ## $ Height_in <dbl> 37, 18, 13, 14, 14, 7, 6, 6, 15, 9, 9, 16, 1… ## $ Width_in <dbl> 29.5, 14.0, 16.0, 18.0, 18.0, 10.0, 13.0, 13… ## $ .fitted <dbl> 26.65490, 14.55256, 16.11415, 17.67574, 17.6… ## $ .resid <dbl> 10.3451004, 3.4474447, -3.1141481, -3.675740… ## $ .hat <dbl> 0.0003991488, 0.0003961825, 0.0003611963, 0.… ## $ .sigma <dbl> 8.303538, 8.305367, 8.305409, 8.305336, 8.30… ## $ .cooksd <dbl> 3.099689e-04, 3.416655e-05, 2.541574e-05, 3.… ## $ .std.resid <dbl> 1.24600543, 0.41522347, -0.37507338, -0.4427… ``` --- ## Looking for... - Residuals distributed randomly around 0 - With no visible pattern along the x or y axes <img src="u4-d03-modeling-nonlinear-relationships_files/figure-html/unnamed-chunk-6-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Not looking for... .large[ **Fan shapes** ] <img src="u4-d03-modeling-nonlinear-relationships_files/figure-html/unnamed-chunk-7-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Not looking for... .large[ **Groups of patterns** ] <img src="u4-d03-modeling-nonlinear-relationships_files/figure-html/unnamed-chunk-8-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Not looking for... .large[ **Residuals correlated with predicted values** ] <img src="u4-d03-modeling-nonlinear-relationships_files/figure-html/unnamed-chunk-9-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Not looking for... .large[ **Any patterns!** ] <img src="u4-d03-modeling-nonlinear-relationships_files/figure-html/unnamed-chunk-10-1.png" width="60%" style="display: block; margin: auto;" /> --- .question[ What patterns does the residuals plot reveal that should make us question whether a linear model is a good fit for modeling the relationship between height and width of paintings? ] <img src="u4-d03-modeling-nonlinear-relationships_files/figure-html/unnamed-chunk-11-1.png" width="60%" style="display: block; margin: auto;" /> --- class: middle # Exploring linearity --- ## Data: Paris Paintings <img src="u4-d03-modeling-nonlinear-relationships_files/figure-html/unnamed-chunk-12-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Price vs. width <img src="u4-d03-modeling-nonlinear-relationships_files/figure-html/unnamed-chunk-13-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Focus on paintings with `Width_in < 100` That is, paintings with width < 2.54 m ```r pp_wt_lt_100 <- pp %>% filter(Width_in < 100) ``` --- ## Price vs. width .question[ Which plot shows a more linear relationship? ] .small[ .pull-left[ <img src="u4-d03-modeling-nonlinear-relationships_files/figure-html/unnamed-chunk-15-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="u4-d03-modeling-nonlinear-relationships_files/figure-html/unnamed-chunk-16-1.png" width="100%" style="display: block; margin: auto;" /> ] ] --- ## Price vs. width, residuals .question[ Which plot shows a residuals that are uncorrelated with predicted values from the model? Also, what is the unit of the residuals? ] .pull-left[ <img src="u4-d03-modeling-nonlinear-relationships_files/figure-html/unnamed-chunk-17-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="u4-d03-modeling-nonlinear-relationships_files/figure-html/unnamed-chunk-18-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Transforming the data - We saw that `price` has a right-skewed distribution, and the relationship between price and width of painting is non-linear. -- - In these situations a transformation applied to the response variable may be useful. -- - In order to decide which transformation to use, we should examine the distribution of the response variable. -- - The extremely right skewed distribution suggests that a log transformation may be useful. - log = natural log, `\(ln\)` - Default base of the `log` function in R is the natural log: <br> `log(x, base = exp(1))` --- ## Logged price vs. width .question[ How do we interpret the slope of this model? ] <img src="u4-d03-modeling-nonlinear-relationships_files/figure-html/unnamed-chunk-19-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Models with log transformation ```r linear_reg() %>% set_engine("lm") %>% fit(log(price) ~ Width_in, data = pp_wt_lt_100) %>% tidy() ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 4.67 0.0585 79.9 0. ## 2 Width_in 0.0192 0.00226 8.48 3.36e-17 ``` --- ## Interpreting the slope $$ \widehat{log(price)} = 4.67 + 0.0192 Width $$ -- - For each additional inch the painting is wider, the log price of the painting is expected to be higher, on average, by 0.0192 livres. -- - which is not a very useful statement... --- ## Working with logs - Subtraction and logs: `\(log(a) − log(b) = log(a / b)\)` -- - Natural logarithm: `\(e^{log(x)} = x\)` -- - We can use these identities to "undo" the log transformation --- ## Interpreting the slope The slope coefficient for the log transformed model is 0.0192, meaning the log price difference between paintings whose widths are one inch apart is predicted to be 0.0192 log livres. -- .question[ Using this information, and properties of logs that we just reviewed, fill in the blanks in the following alternate interpretation of the slope: >For each additional inch the painting is wider, the price of the painting is expected to be `___` , on average, by a factor of `___`. ] --- >For each additional inch the painting is wider, the price of the painting is expected to be `___` , on average, by a factor of `___`. $$ log(\text{price for width x+1}) - log(\text{price for width x}) = 0.0192 $$ -- $$ log\left(\frac{\text{price for width x+1}}{\text{price for width x}}\right) = 0.0192 $$ -- $$ e^{log\left(\frac{\text{price for width x+1}}{\text{price for width x}}\right)} = e^{0.0192} $$ -- $$ \frac{\text{price for width x+1}}{\text{price for width x}} \approx 1.02 $$ -- For each additional inch the painting is wider, the price of the painting is expected to be higher, on average, by a factor of 1.02. --- ## Recap - Non-constant variance is one of the most common model violations, however it is usually fixable by transforming the response (y) variable. -- - The most common transformation when the response variable is right skewed is the log transform: `\(log(y)\)`, especially useful when the response variable is (extremely) right skewed. -- - This transformation is also useful for variance stabilization. -- - When using a log transformation on the response variable the interpretation of the slope changes: *"For each unit increase in x, y is expected on average to be higher/lower <br> by a factor of `\(e^{b_1}\)`."* -- - Another useful transformation is the square root: `\(\sqrt{y}\)`, especially useful when the response variable is counts. --- ## Transform, or learn more? - Data transformations may also be useful when the relationship is non-linear - However in those cases a polynomial regression may be more appropriate + This is beyond the scope of this course, but you’re welcomed to try it for your final project, and I’d be happy to provide further guidance --- ## Aside: when `\(y = 0\)` In some cases the value of the response variable might be 0, and ```r log(0) ``` ``` ## [1] -Inf ``` -- The trick is to add a very small number to the value of the response variable for these cases so that the `log` function can still be applied: ```r log(0 + 0.00001) ``` ``` ## [1] -11.51293 ```