class: center, middle, inverse, title-slide # Tidy data ##
Introduction to Data Science with R and Tidyverse ### based on datasciencebox.org --- layout: true <div class="my-footer"> <span> Introduction to Data Science with R and Tidyverse | Lukas Jürgensmeier, Matteo Fina, Jan Bischoff | based on <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- ## Tidy data > *Happy families are all alike; every unhappy family is unhappy in its own way.* > > — Leo Tolstoy -- .pull-left[ **Characteristics of tidy data:** - Each variable forms a *column*. - Each observation forms a *row*. - Each type of observational unit forms a *table*. ] -- .pull-right[ **Characteristics of untidy data:** !@#$%^&*() ] --- .question[ What makes this data not tidy? ] <br> <img src="img/hiv-est-prevalence-15-49.png" width="70%" style="display: block; margin: auto;" /> .footnote[ Source: [Gapminder, Estimated HIV prevalence among 15-49 year olds](https://www.gapminder.org/data) ] --- ## Displaying vs. summarising data .panelset[ .panel[.panel-name[Output] .pull-left[ ``` ## # A tibble: 87 x 3 ## name height mass ## <chr> <int> <dbl> ## 1 Luke Skywalker 172 77 ## 2 C-3PO 167 75 ## 3 R2-D2 96 32 ## 4 Darth Vader 202 136 ## 5 Leia Organa 150 49 ## 6 Owen Lars 178 120 ## # ... with 81 more rows ``` ] .pull-right[ ``` ## # A tibble: 3 x 2 ## gender avg_ht ## <chr> <dbl> ## 1 feminine 165. ## 2 masculine 177. ## 3 <NA> 181. ``` ] ] .panel[.panel-name[Code] .pull-left[ ```r starwars %>% select(name, height, mass) ``` ] .pull-right[ ```r starwars %>% group_by(gender) %>% summarize( avg_ht = mean(height, na.rm = TRUE) ) ``` ] ] ]