02_join-skim-eda.Rmd

---
title: "EDA (Skim), Join, Visualize"
author: "John Little"
date: "`r Sys.Date()`"
abstract: "This document is a tidyverse/dplyr tour using R/RStudio.  The instructor will demonstrate how to undertake an initial investigation into two datasets.  More information about our Intro to R workshop can be found at https://rfun.library.duke.edu/intro2r/.  This document is covered by the CC BY-NC license:  https://creativecommons.org/licenses/by-nc/4.0/legalcode.\n"
output:  
  html_notebook:
    toc: true
---

# Load Packages

```{r}
library(tidyverse)
library(skimr)
```

# Load Data

- Star Wars character data are an on-board part of dplyr:  `dplyr::starwars`

- Star Wars survey data are from fivethirtyeight.com.  

    - https://github.com/fivethirtyeight/data/tree/master/star-wars-survey
    - https://data.fivethirtyeight.com/
    - Unless otherwise noted, five thirty eight data sets are available under the Creative Commons Attribution 4.0 International License, and the code is available under the MIT License.
    - https://fivethirtyeight.com/features/americas-favorite-star-wars-movies-and-least-favorite-characters/

Load the survey data that deals with character's favorability rating.  

```{r}
favorability_popularity_rating <- 
  read_csv("data/538_favorability_popularity.csv",
           skip = 11)
```

# Raw Data

## Star Wars Charactesr

```{r show_dplyr-starwars-tibble}
starwars
```

## Favorability Data

I have transformed the *Five Thirty Eight* [data](https://github.com/fivethirtyeight/data/tree/master/star-wars-survey).  It's ready to merge with the other dataset.

```{r show_favorability-rating}
favorability_popularity_rating
```


#  EDA

Exploratory Data Analysis gives me a general sense of the data.  For example, what are the means, medians, distributions of each data variable.

## Skim

Skim the `dplyr::starwars` raw data.

```{r}
skim(starwars)
```


Skim the transformed favorability data gathered from fivethirtyeight.com.

```{r skim-joined}
skim(favorability_popularity_rating)
```

# Join Data

There are different types of joins (e.g. inner, left, right, full, etc.).  See the `dplyr::join` [documentation](http://dplyr.tidyverse.org/reference/join.html) for a more complete explanation. Note that the join key plays a critical role in the success rate of the join.  Joining on alphanumeric keys can be problematic due to complications such as diacritical marks, case sensitivity, and spaces. Ideally, match on consistent numeric keys.


```{r join-data}
sw_joined <- starwars %>% 
  left_join(favorability_popularity_rating, by = "name") %>% 
  select(1, 14, 2:14) 

sw_joined %>% 
  arrange(desc(fav_rating))
```

### Skim Joined Data

```{r}
sw_joined %>% 
  drop_na(fav_rating) %>% 
  skim()
```


### anti_join

Joins that use a natural language key typically yield imprecise matching. Additional join functions can help you identify what didn't match.  Refer to the multiple join functions:  http://dplyr.tidyverse.org/reference/join.html

In this case we can use the `anti_join` function to determine what didn't match.  The natural language variable (`name` -- a character vector) has multiple spellings and variations.  Thus the following didn't match.  Can you identify what caused the mismatches with the following names?

```{r}
favorability_popularity_rating %>% anti_join(starwars, by = "name")
```

### inner_join

Alternatively to the `left_join` (which doesn't loose data), we could use the `inner_join` to retain only data when there is a match.  We could also use the results from the anti_join to edit the values in the matching key (`name`) and thereby force more key matches.  
 
```{r}
starwars %>% inner_join(favorability_popularity_rating, by = "name")
```

# Visualize

Basic ggplot2 template 

```r
ggplot(data = starwars, aes(x = height, y = fav_rating)) +
  geom_----()

```


## Scatter Plot

Is a character's favorability dependent on their individual height?

```{r}
sw_joined %>% 
  drop_na(fav_rating) %>% 
  ggplot(aes(x = height, y = fav_rating)) +
  geom_point()
```


## Histogram

What is the distribution of the birth years of all the characters?

```{r}
starwars %>% 
  ggplot(aes(x = birth_year)) +
  geom_histogram()
```


What is the distribution of the character's height?

```{r}
starwars %>% 
  ggplot(aes(x = height)) +
  geom_histogram()
```

## Box Plot

What is the distribution of the mass of three species:  Human, Droid, Gungan?

```{r}
starwars %>% 
  drop_na(species, mass) %>% 
  filter(species == "Human" |
           species == "Droid" |
           species == "Gungan") %>% 
  ggplot(aes(x = species, y = mass)) +
  geom_boxplot()
```


## Bar Plot

```{r}
ggplot(data = starwars) +
  geom_bar(aes(fct_infreq(eye_color)))
```

There are at least two different barplot geom_s.

- [`geom_bar`](https://ggplot2.tidyverse.org/reference/geom_bar.html) - counts the cases (rows) in the dataframe variable and uses that count to display the bar heights.

- [`geom_col`](https://ggplot2.tidyverse.org/reference/geom_bar.html) - use `geom_col` when bar heights are already represented in the dataframe.  For example, summary counts shows represented values that should be used for the bar height.

Sticking with geom_bar, below bar graph that uses more of ggplots features to generate a more sophisticated graph.


```{r}
ggplot(starwars) +
  geom_bar(aes(fct_rev(fct_lump(fct_infreq(eye_color), n = 6))), 
           fill = "grey") + 
  coord_flip() +
  geom_bar(data = starwars %>% filter(eye_color == "orange"),
           aes(eye_color), fill = "orange") +
  labs(x = "Eye Color", y = "",
       title = "Eye Color of Starwars Characters",
       subtitle = "Oranges eyes are uncommon.",
       caption = "Data Source: dplyr::starwars")

```

## Other Aesthetics

Aside from identifying the x and y data vectors in the `aes()` function, there are several other aesthetics that you can manipulate:  color, fill, lines, shape, text.  See more at the [Aesthetic specifications documentation](https://ggplot2.tidyverse.org/articles/ggplot2-specs.html#polygons). 

For example, below we map color and size to variables within the data frame.  Note that the legends for the `species` and `birth_year` variables are automatically generated by ggplot when variables are assigned within the aesthetic argument.  

Next, use a fixed value outside the `aes()` function -- but inside the `geom_`.  In this case, the `alpha` argument affects the opacity of all the plotted points -- one way to handle the problem of overplotting.  

```{r}
starwars %>% 
  drop_na(mass, height, birth_year) %>% 
  filter(mass < 400,
         species == "Human" | 
           species == "Droid" | 
           species == "Gungan" |
           species == "Mirialan" |
           species == "Ewok" |
           species == "Wookiee") %>% 
  ggplot(aes(x = mass, y = height)) +
    geom_point(aes(color = species, size = birth_year), alpha = 0.7)
```
 

# Session Info

For reproducibility, it's always good to document the session info.

```{r session_info}
sessioninfo::session_info()
```

# License

Creative Commons.  CC By-NC  https://creativecommons.org/licenses/by-nc/4.0/legalcode