/
02_join-skim-eda.Rmd
236 lines (157 loc) · 6.99 KB
/
02_join-skim-eda.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
---
title: "EDA (Skim), Join, Visualize"
author: "John Little"
date: "`r Sys.Date()`"
abstract: "This document is a tidyverse/dplyr tour using R/RStudio. The instructor will demonstrate how to undertake an initial investigation into two datasets. More information about our Intro to R workshop can be found at https://rfun.library.duke.edu/intro2r/. This document is covered by the CC BY-NC license: https://creativecommons.org/licenses/by-nc/4.0/legalcode.\n"
output:
html_notebook:
toc: true
---
# Load Packages
```{r}
library(tidyverse)
library(skimr)
```
# Load Data
- Star Wars character data are an on-board part of dplyr: `dplyr::starwars`
- Star Wars survey data are from fivethirtyeight.com.
- https://github.com/fivethirtyeight/data/tree/master/star-wars-survey
- https://data.fivethirtyeight.com/
- Unless otherwise noted, five thirty eight data sets are available under the Creative Commons Attribution 4.0 International License, and the code is available under the MIT License.
- https://fivethirtyeight.com/features/americas-favorite-star-wars-movies-and-least-favorite-characters/
Load the survey data that deals with character's favorability rating.
```{r}
favorability_popularity_rating <-
read_csv("data/538_favorability_popularity.csv",
skip = 11)
```
# Raw Data
## Star Wars Charactesr
```{r show_dplyr-starwars-tibble}
starwars
```
## Favorability Data
I have transformed the *Five Thirty Eight* [data](https://github.com/fivethirtyeight/data/tree/master/star-wars-survey). It's ready to merge with the other dataset.
```{r show_favorability-rating}
favorability_popularity_rating
```
# EDA
Exploratory Data Analysis gives me a general sense of the data. For example, what are the means, medians, distributions of each data variable.
## Skim
Skim the `dplyr::starwars` raw data.
```{r}
skim(starwars)
```
Skim the transformed favorability data gathered from fivethirtyeight.com.
```{r skim-joined}
skim(favorability_popularity_rating)
```
# Join Data
There are different types of joins (e.g. inner, left, right, full, etc.). See the `dplyr::join` [documentation](http://dplyr.tidyverse.org/reference/join.html) for a more complete explanation. Note that the join key plays a critical role in the success rate of the join. Joining on alphanumeric keys can be problematic due to complications such as diacritical marks, case sensitivity, and spaces. Ideally, match on consistent numeric keys.
```{r join-data}
sw_joined <- starwars %>%
left_join(favorability_popularity_rating, by = "name") %>%
select(1, 14, 2:14)
sw_joined %>%
arrange(desc(fav_rating))
```
### Skim Joined Data
```{r}
sw_joined %>%
drop_na(fav_rating) %>%
skim()
```
### anti_join
Joins that use a natural language key typically yield imprecise matching. Additional join functions can help you identify what didn't match. Refer to the multiple join functions: http://dplyr.tidyverse.org/reference/join.html
In this case we can use the `anti_join` function to determine what didn't match. The natural language variable (`name` -- a character vector) has multiple spellings and variations. Thus the following didn't match. Can you identify what caused the mismatches with the following names?
```{r}
favorability_popularity_rating %>% anti_join(starwars, by = "name")
```
### inner_join
Alternatively to the `left_join` (which doesn't loose data), we could use the `inner_join` to retain only data when there is a match. We could also use the results from the anti_join to edit the values in the matching key (`name`) and thereby force more key matches.
```{r}
starwars %>% inner_join(favorability_popularity_rating, by = "name")
```
# Visualize
Basic ggplot2 template
```r
ggplot(data = starwars, aes(x = height, y = fav_rating)) +
geom_----()
```
## Scatter Plot
Is a character's favorability dependent on their individual height?
```{r}
sw_joined %>%
drop_na(fav_rating) %>%
ggplot(aes(x = height, y = fav_rating)) +
geom_point()
```
## Histogram
What is the distribution of the birth years of all the characters?
```{r}
starwars %>%
ggplot(aes(x = birth_year)) +
geom_histogram()
```
What is the distribution of the character's height?
```{r}
starwars %>%
ggplot(aes(x = height)) +
geom_histogram()
```
## Box Plot
What is the distribution of the mass of three species: Human, Droid, Gungan?
```{r}
starwars %>%
drop_na(species, mass) %>%
filter(species == "Human" |
species == "Droid" |
species == "Gungan") %>%
ggplot(aes(x = species, y = mass)) +
geom_boxplot()
```
## Bar Plot
```{r}
ggplot(data = starwars) +
geom_bar(aes(fct_infreq(eye_color)))
```
There are at least two different barplot geom_s.
- [`geom_bar`](https://ggplot2.tidyverse.org/reference/geom_bar.html) - counts the cases (rows) in the dataframe variable and uses that count to display the bar heights.
- [`geom_col`](https://ggplot2.tidyverse.org/reference/geom_bar.html) - use `geom_col` when bar heights are already represented in the dataframe. For example, summary counts shows represented values that should be used for the bar height.
Sticking with geom_bar, below bar graph that uses more of ggplots features to generate a more sophisticated graph.
```{r}
ggplot(starwars) +
geom_bar(aes(fct_rev(fct_lump(fct_infreq(eye_color), n = 6))),
fill = "grey") +
coord_flip() +
geom_bar(data = starwars %>% filter(eye_color == "orange"),
aes(eye_color), fill = "orange") +
labs(x = "Eye Color", y = "",
title = "Eye Color of Starwars Characters",
subtitle = "Oranges eyes are uncommon.",
caption = "Data Source: dplyr::starwars")
```
## Other Aesthetics
Aside from identifying the x and y data vectors in the `aes()` function, there are several other aesthetics that you can manipulate: color, fill, lines, shape, text. See more at the [Aesthetic specifications documentation](https://ggplot2.tidyverse.org/articles/ggplot2-specs.html#polygons).
For example, below we map color and size to variables within the data frame. Note that the legends for the `species` and `birth_year` variables are automatically generated by ggplot when variables are assigned within the aesthetic argument.
Next, use a fixed value outside the `aes()` function -- but inside the `geom_`. In this case, the `alpha` argument affects the opacity of all the plotted points -- one way to handle the problem of overplotting.
```{r}
starwars %>%
drop_na(mass, height, birth_year) %>%
filter(mass < 400,
species == "Human" |
species == "Droid" |
species == "Gungan" |
species == "Mirialan" |
species == "Ewok" |
species == "Wookiee") %>%
ggplot(aes(x = mass, y = height)) +
geom_point(aes(color = species, size = birth_year), alpha = 0.7)
```
# Session Info
For reproducibility, it's always good to document the session info.
```{r session_info}
sessioninfo::session_info()
```
# License
Creative Commons. CC By-NC https://creativecommons.org/licenses/by-nc/4.0/legalcode