Doing data science refresher

.title[
# Doing data science refresher
]
.subtitle[
## <br><br> College of the Atlantic
]
.author[
### <a href="https://coa-data-analysis.netlify.app/">https://coa-data-analysis.netlify.app/</a>
]

---

# Welcome to Data Science 2: Programming for Data Analysis

---

# Introductions

<div class="figure" style="text-align: left">
<img src="img/avatar.jpg" alt="Image Credit: R monster mascots by @Allison Horst" width="50%" />
<p class="caption">Image Credit: R monster mascots by @Allison Horst</p>
</div>

* Name (and pronouns if you want to share)
* Option 1: Something you recently taught yourself how to do
* Option 2: What item you'd be most excited to find in the free box?

---

# Agenda

## Part 1: Get to know one another

* Walk around the room and find three people
* Re-introduce yourselves to your group
* Find something you all have in common
* Continue on until you have formed three groups.

## Part 2: What is this course going to look like?

## Part 3: Open Questions

---

## Data science

.pull-left-wide[
- Data science is an exciting discipline that allows you to turn data into understanding, insight, and knowledge.

- We're going to learn to do this in a `tidy` way -- more on that later!

- This is a course on programming for analysis and visualization, with an emphasis on statistical thinking.
]

---

# Data science life cycle

---

???
Let's also talk about the Data Science Life Cycle. This is the diagram from the book R for Data Science that we'll be referring to throughout the course. Note that this isn't the only diagram out there representing the data science life cycle but it is the one that we are using to structure this course.

So how does the data science life cycle begin?
---

???
Usually you have some data maybe in a spreadsheet or a database and you need to import into R.

---

???

Then we need to spend some time organising that data to make it easier to use and analyse. This often includes doublechecking the data for mistakes and tidying it and it may also include transforming it to get it to the table that you want, that makes it easier to use or analyse.

---

???

Once the data is in a format that is easy to work with, you want to visualize your data to start to gain some insights from it. 
---

???
Then, perhaps you will go onto modelling your data.

---

???
And the reality is it never ends there. You will gain more insight into the data and you may need to go back and check and adjust your assumptions.

That last step is communicating your results and finding.

---
class: middle

# What's in a data analysis?

---

## Five core activities of data analysis

1. Stating and refining the question
1. Exploring the data
1. Building formal statistical models
1. Interpreting the results
1. Communicating the results

.footnote[
Roger D. Peng and Elizabeth Matsui. "The Art of Data Science." A Guide for Anyone Who Works with Data. Skybrude Consulting, LLC (2015).
]

---

# Stating and refining the question

---

## Six types of questions

1. **Descriptive:** summarize a characteristic of a set of data
1. **Exploratory:** analyze to see if there are patterns, trends, or relationships between variables (hypothesis generating)
1. **Inferential:** analyze patterns, trends, or relationships in representative data from a population
1. **Predictive:** make predictions for individuals or groups of individuals
1. **Causal:** whether changing one factor will change another factor, on average, in a population
1. **Mechanistic:** explore "how" as opposed to whether

.footnote[
Jeffery T. Leek and Roger D. Peng. "What is the question?." Science 347.6228 (2015): 1314-1315.
]

---

## Ex: COVID-19 and Vitamin D

1. **Descriptive:** frequency of hospitalisations due to COVID-19 in a set of data collected from a group of individuals
--

1. **Exploratory:** examine relationships between a range of dietary factors and COVID-19 hospitalisations
--

1. **Inferential:** examine whether any relationship between taking Vitamin D supplements and COVID-19 hospitalisations found in the sample hold for the population at large

--
1. **Predictive:** what types of people will take Vitamin D supplements during the next year

--
1. **Causal:** whether people with COVID-19 who were randomly assigned to take Vitamin D supplements or those who were not are hospitalised

--
1. **Mechanistic:** how increased vitamin D intake leads to a reduction in the number of viral illnesses

---

## Questions to data science problems

- Do you have appropriate data to answer your question?
- Do you have information on confounding variables?
- Was the data you're working with collected in a way that introduces bias?

.question[
Suppose I want to estimate the average number of children in households in Edinburgh. I conduct a survey at an elementary school in Edinburgh and ask students at this elementary school how many children, including themselves, live in their house. Then, I take the average of the responses. Is this a biased or an unbiased estimate of the number of children in households in Edinburgh? If biased, will the value be an overestimate or underestimate?
]

---

# Exploratory data analysis

---

## Checklist

- Formulate your question
- Read in your data
- Check the dimensions
- Look at the top and the bottom of your data
- Validate with at least one external data source
- Make a plot
- Try the easy solution first

---

## Formulate your question

- Consider scope:
  - Are air pollution levels higher on the east coast than on the west coast?
  - Are hourly ozone levels on average higher in New York City than they are in Los Angeles?
  - Do counties in the eastern United States have higher ozone levels than counties in the western United States?
- Most importantly: "Do I have the right data to answer this question?"

---

## Read in your data

- Place your data in a folder called `data`
- Read it into R with `read_csv()` or friends (`read_delim()`, `read_excel()`, etc.)

```r
library(readxl)
fav_food <- read_excel("data/favourite-food.xlsx")
fav_food
```

```
## # A tibble: 5 × 6
##   `Student ID` `Full Name`    favourite.food mealPlan AGE   SES  
##          <dbl> <chr>          <chr>          <chr>    <chr> <chr>
## 1            1 Sunil Huffmann Strawberry yo… Lunch o… 4     High 
## 2            2 Barclay Lynn   French fries   Lunch o… 5     Midd…
## 3            3 Jayendra Lyne  N/A            Breakfa… 7     Low  
## 4            4 Leon Rossini   Anchovies      Lunch o… 99999 Midd…
## 5            5 Chidiegwu Dun… Pizza          Breakfa… five  High
```

---

## `clean_names()`

If the variable names are malformatted, use `janitor::clean_names()`

```r
library(janitor)
fav_food %>% clean_names()  
```

```
## # A tibble: 5 × 6
##   student_id full_name       favourite_food meal_plan age   ses  
##        <dbl> <chr>           <chr>          <chr>     <chr> <chr>
## 1          1 Sunil Huffmann  Strawberry yo… Lunch on… 4     High 
## 2          2 Barclay Lynn    French fries   Lunch on… 5     Midd…
## 3          3 Jayendra Lyne   N/A            Breakfas… 7     Low  
## 4          4 Leon Rossini    Anchovies      Lunch on… 99999 Midd…
## 5          5 Chidiegwu Dunk… Pizza          Breakfas… five  High
```

---

## Case study: NYC Squirrels!

- [The Squirrel Census](https://www.thesquirrelcensus.com/) is a multimedia science, design, and storytelling project focusing on the Eastern gray (*Sciurus carolinensis*). They count squirrels and present their findings to the public.
- This table contains squirrel data for each of the 3,023 sightings, including location coordinates, age, primary and secondary fur color, elevation, activities, communications, and interactions between squirrels and with humans.

```r
# install_github("mine-cetinkaya-rundel/nycsquirrels18")
library(nycsquirrels18)
```

---

## Locate the codebook

[Squirrel's codebook](https://mine-cetinkaya-rundel.github.io/nycsquirrels18/reference/squirrels.html)

## Check the dimensions

```r
dim(squirrels)
```

```
## [1] 3023   35
```

---

## Look at the top...

```r
squirrels %>% head()
```

```
## # A tibble: 6 × 35
##    long   lat unique_squirrel_id hectare shift date      
##   <dbl> <dbl> <chr>              <chr>   <chr> <date>    
## 1 -74.0  40.8 13A-PM-1014-04     13A     PM    2018-10-14
## 2 -74.0  40.8 15F-PM-1010-06     15F     PM    2018-10-10
## 3 -74.0  40.8 19C-PM-1018-02     19C     PM    2018-10-18
## 4 -74.0  40.8 21B-AM-1019-04     21B     AM    2018-10-19
## 5 -74.0  40.8 23A-AM-1018-02     23A     AM    2018-10-18
## 6 -74.0  40.8 38H-PM-1012-01     38H     PM    2018-10-12
## # ℹ 29 more variables: hectare_squirrel_number <dbl>, age <chr>,
## #   primary_fur_color <chr>, highlight_fur_color <chr>,
## #   combination_of_primary_and_highlight_color <chr>,
## #   color_notes <chr>, location <chr>,
## #   above_ground_sighter_measurement <chr>,
## #   specific_location <chr>, running <lgl>, chasing <lgl>,
## #   climbing <lgl>, eating <lgl>, foraging <lgl>, …
```

---

## ...and the bottom

```r
squirrels %>% tail()
```

```
## # A tibble: 6 × 35
##    long   lat unique_squirrel_id hectare shift date      
##   <dbl> <dbl> <chr>              <chr>   <chr> <date>    
## 1 -74.0  40.8 6D-PM-1020-01      06D     PM    2018-10-20
## 2 -74.0  40.8 21H-PM-1018-01     21H     PM    2018-10-18
## 3 -74.0  40.8 31D-PM-1006-02     31D     PM    2018-10-06
## 4 -74.0  40.8 37B-AM-1018-04     37B     AM    2018-10-18
## 5 -74.0  40.8 21C-PM-1006-01     21C     PM    2018-10-06
## 6 -74.0  40.8 7G-PM-1018-04      07G     PM    2018-10-18
## # ℹ 29 more variables: hectare_squirrel_number <dbl>, age <chr>,
## #   primary_fur_color <chr>, highlight_fur_color <chr>,
## #   combination_of_primary_and_highlight_color <chr>,
## #   color_notes <chr>, location <chr>,
## #   above_ground_sighter_measurement <chr>,
## #   specific_location <chr>, running <lgl>, chasing <lgl>,
## #   climbing <lgl>, eating <lgl>, foraging <lgl>, …
```
]

---

## Validate with at least one external data source

```
## # A tibble: 3,023 × 2
##     long   lat
##    <dbl> <dbl>
##  1 -74.0  40.8
##  2 -74.0  40.8
##  3 -74.0  40.8
##  4 -74.0  40.8
##  5 -74.0  40.8
##  6 -74.0  40.8
##  7 -74.0  40.8
##  8 -74.0  40.8
##  9 -74.0  40.8
## 10 -74.0  40.8
## 11 -74.0  40.8
## 12 -74.0  40.8
## 13 -74.0  40.8
## 14 -74.0  40.8
## 15 -74.0  40.8
## # ℹ 3,008 more rows
```
]
.pull-right[
<img src="img/central-park-coords.png" width="100%" style="display: block; margin: auto;" />
]

---

## Make a plot

```r
ggplot(squirrels, aes(x = long, y = lat)) +
  geom_point(alpha = 0.2)
```

.pull-left-wide[
**Hypothesis:** There will be a higher density of sightings on the perimeter than inside the park.
]

---

## Try the easy solution first

.panel[.panel-name[Plot]
<img src="u2-d17-doing-data-science_files/figure-html/unnamed-chunk-18-1.png" width="60%" style="display: block; margin: auto;" />
]

```r
squirrels <- squirrels %>%
  separate(hectare, into = c("NS", "EW"), sep = 2, remove = FALSE) %>%
  mutate(where = if_else(NS %in% c("01", "42") | EW %in% c("A", "I"), "perimeter", "inside"))

ggplot(squirrels, aes(x = long, y = lat, color = where)) +
  geom_point(alpha = 0.2)
```
]

]

---

## Then go deeper...

.panel[.panel-name[Plot]
<img src="u2-d17-doing-data-science_files/figure-html/unnamed-chunk-19-1.png" width="60%" style="display: block; margin: auto;" />
]

```r
hectare_counts <- squirrels %>%
  group_by(hectare) %>%
  summarise(n = n())

hectare_centroids <- squirrels %>%
  group_by(hectare) %>%
  summarise(
    centroid_x = mean(long),
    centroid_y = mean(lat)
  )

squirrels %>%
  left_join(hectare_counts, by = "hectare") %>%
  left_join(hectare_centroids, by = "hectare") %>%
  ggplot(aes(x = centroid_x, y = centroid_y, color = n)) +
  geom_hex()
```
]

]

---

## The squirrel is staring at me!

```r
squirrels %>%
  filter(str_detect(other_interactions, "star")) %>%
  select(shift, age, other_interactions)
```

```
## # A tibble: 11 × 3
##   shift age   other_interactions                                 
##   <chr> <chr> <chr>                                              
## 1 AM    Adult staring at us                                      
## 2 PM    Adult he took 2 steps then turned and stared at me       
## 3 PM    Adult stared                                             
## 4 PM    Adult stared                                             
## 5 PM    Adult stared                                             
## 6 PM    Adult stared & then went back up tree—then ran to differ…
## # ℹ 5 more rows
```

---

## Communicating for your audience

- Avoid: Jargon, uninterpreted results, lengthy output
- Pay attention to: Organization, presentation, flow
- Don't forget about: Code style, coding best practices, meaningful commits
- Be open to: Suggestions, feedback, taking (calculated) risks

---
## Acknowledgements

* This course builds on the materials from [Data Science in a Box](https://datasciencebox.org/) developed by Mine Çetinkaya-Rundel and are adapted under the [Creative Commons Attribution Share Alike 4.0 International](https://github.com/rstudio-education/datascience-box/blob/master/LICENSE.md)