Meet the toolkit: programming

---

## Data science

.pull-left-wide[
- Data science is an exciting discipline that allows you to turn data into understanding, insight, and knowledge.

- We're going to learn to do this in a `tidy` way -- more on that later!

- This is a course on programming for analysis and visualization, with an emphasis on statistical thinking.
]

---

## Course FAQ

.pull-left-wide[
**Q - What data science/programming background does this course assume?**  
A - None.

**Q - Is this an intro stat course?**  
A - While statistics `$\ne$` data science, they are very closely related and have tremendous of overlap. Hence, this course is a great way to get started with statistics. However this course is *not* your typical high school statistics course and there is a greater emphasis on data visualization.

**Q - Will we be doing computing?**   
A - Yes.
]

---

## Course Schedule

- .pink[**Tuesday**]: Topic Intro
- .pink[**Thursday**]: Topic Intro Lab
- .pink[**Help Sessions**]: Hathorn 106 and PLC

---
# One link to rule them all...

... where you can find everything except your course grades!

[https://bates-dataviz.netlify.app/](https://bates-dataviz.netlify.app/)

Bookmark the link!

???
This link is going to have everything but your course grades. It will link you to the relevant places including lyceum.

The course material will be released on a weekly basis, this is so that you can take a look at the week ahead. There are links to videos, readings and others. I try to keep these links up to date but I may occasionally miss something and I'd love to hear from you if you have trouble.

---

# Software

---

???

Before I did my undergraduate this is what data analysis mostly meant to me. Now this form of data in rows and columns may be familiar to some of you if you have worked with excel and spreadsheets. Now many of us if we are collecting data like to put it into something like this.

So data that comes in a tabular format like this might be familiar to you.

---

???
When I did my undergraduate thesis and started learning R this is what R looked like. I remember thinking of the interface as a black box and I wasn't really sure where the data was contained.

---

???

In this course we are going to be using something a little bit different. We are going to be using Posit to interact with the computing language. If you remember the images from the previous two slides you can see that Posit combines those two components so that you can view the data and also execute code in the console. Posit is the Integrated Development Environment (IDE) that we are going to use in this course.

---

# Data science life cycle

---

???
Let's also talk about the Data Science Life Cycle. This is the diagram from the book R for Data Science that we'll be referring to throughout the course. Note that this isn't the only diagram out there representing the data science life cycle but it is the one that we are using to structure this course.

So how does the data science life cycle begin?
---

???
Usually you have some data maybe in a spreadsheet or a database and you need to import into R.

---

???

Then we need to spend some time organising that data to make it easier to use and analyse. This often includes doublechecking the data for mistakes and tidying it and it may also include transforming it to get it to the table that you want, that makes it easier to use or analyse.

---

???

Once the data is in a format that is easy to work with, you want to visualize your data to start to gain some insights from it. 
---

???
Then, perhaps you will go onto modelling your data.

---

???
And the reality is it never ends there. You will gain more insight into the data and you may need to go back and check and adjust your assumptions.

That last step is communicating your results and finding.

---

## Course toolkit

.pull-left[
### .gray[Course operation]
.gray[
- [bates-dataviz.netlify.app](https://bates-dataviz.netlify.app/)
- [Google classroom](https://classroom.google.com/)
]
]
.pull-right[
### .pink[Doing data science]
- .pink[Programming:]
  - .pink[R]
  - .pink[Posit]
  - .pink[tidyverse]
  - .pink[R Markdown]
- .gray[Version control and collaboration:]
  - .gray[Git]
  - .gray[GitHub]
]

---

## Learning goals

By the end of the course, you will be able to...

- gain insight from data

--
- gain insight from data, **reproducibly**

--
- gain insight from data, reproducibly, **using modern programming tools and techniques**

--
- gain insight from data, reproducibly **and collaboratively**, using modern programming tools and techniques

--
- gain insight from data, reproducibly **(with literate programming and version control)** and collaboratively, using modern programming tools and techniques

---

# Reproducible data analysis

---

## Reproducibility checklist

Near-term goals:

- Are the tables and figures reproducible from the code and data?
- Does the code actually do what you think it does?
- In addition to what was done, is it clear *why* it was done?

Long-term goals:

- Can the code be used for other data?
- Can you extend the code to do other things?

---

## Toolkit for reproducibility

- Scriptability `$\rightarrow$` .pink[R]
- Literate programming (code, narrative, output in one place) `$\rightarrow$` .pink[R Markdown]
- Version control `$\rightarrow$` .pink[Git / GitHub]

---

# R and Posit

---

## R and Posit

.pull-left[
<img src="img/r-logo.png" width="25%" style="display: block; margin: auto;" />
- R is an open-source statistical **programming language**
- R is also an environment for statistical computing and graphics
- It's easily extensible with *packages*
]
.pull-right[
<img src="img/rstudio-logo.png" width="50%" style="display: block; margin: auto;" />
- Posit is a convenient interface for R called an **IDE** (integrated development environment), e.g. *"I write R code in the Posit IDE"*
- Posit is not a requirement for programming with R, but it's very commonly used by R programmers and data scientists
]

---

## R packages

- **Packages** are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data1

- As of September 2020, there are over 16,000 R packages available on **CRAN** (the Comprehensive R Archive Network)2

- We're going to work with a small (but important) subset of these!

2 [CRAN contributed packages](https://cran.r-project.org/web/packages/).
]

---

## Tour: R and Posit

---

## A short list (for now) of R essentials

- Functions are (most often) verbs, followed by what they will be applied to in parentheses:

``` r
do_this(to_this)
do_that(to_this, to_that, with_those)
```

- Packages are installed with the `install.packages` function and loaded with the `library` function, once per session:

``` r
install.packages("package_name")
library(package_name)
```

---

## R essentials (continued)

- Columns (variables) in data frames are accessed with `$`:

``` r
dataframe$var_name
```
]

- Object documentation can be accessed with `?`

``` r
?mean
```

---

## tidyverse

- The **tidyverse** is an opinionated collection of R packages designed for data science
- All packages share an underlying philosophy and a common grammar
]

---

## rmarkdown

- **rmarkdown** and the various packages that support it enable R users to write their code and prose in reproducible computational documents
- We will generally refer to R Markdown documents (with `.Rmd` extension), e.g. *"Do this in your R Markdown document"* and rarely discuss loading the rmarkdown package
]

---

# R Markdown

---

## R Markdown

- Fully reproducible reports -- each time you knit the analysis is ran from the beginning
- Simple markdown syntax for text
- Code goes in chunks, defined by three backticks, narrative goes outside of chunks

---

## Tour: R Markdown

---

## Environments

Remember this, and expect it to bite you a few times as you're learning to work 
with R Markdown!

---

## Environments

``` r
x <- 2
x * 3
```
]

``` r
x * 3
```
]

---

## R Markdown help

.pull-left[
.center[
.midi[R Markdown Cheat Sheet 
`Help -> Cheatsheets`]
]
<img src="img/rmd-cheatsheet.png" width="80%" style="display: block; margin: auto;" />
]
.pull-right[
.center[
.midi[Markdown Quick Reference 
`Help -> Markdown Quick Reference`]
]
<img src="img/md-cheatsheet.png" width="80%" style="display: block; margin: auto;" />
]

---

## How will we use R Markdown?

- Every assignment / report / project / etc. is an R Markdown document
- You'll always have a template R Markdown document to start with
- The amount of scaffolding in the template will decrease over the semester

---

## What's with all the hexes?

.footnote[
Mitchell O'Hara-Wild, [useR! 2018 feature wall](https://www.mitchelloharawild.com/blog/user-2018-feature-wall/)
]

---

.your-turn[
.light-blue[.hand[Your turn:]] `AE 02 - Bechdel + R Markdown`
- [The Bechdel test](https://en.wikipedia.org/wiki/Bechdel_test) asks whether a work of fiction features at least two women who talk to each other about something other than a man, and there must be two women named characters.
- Go to [PositCloud](https://posit.cloud/) and start the assignment `AE 02 - Bechdel + R Markdown`. 
- Open and knit the R Markdown document `bechdel.Rmd`, review the document, and fill in the blanks.
]

---
## Acknowledgements

* This course builds on the materials from [Data Science in a Box](https://datasciencebox.org/) developed by Mine Çetinkaya-Rundel and are adapted under the [Creative Commons Attribution Share Alike 4.0 International](https://github.com/rstudio-education/datascience-box/blob/master/LICENSE.md)