class: center, middle, inverse, title-slide # Data and visualisation ##
Introduction to Data Science with R and Tidyverse ### based on datasciencebox.org --- layout: true <div class="my-footer"> <span> Introduction to Data Science with R and Tidyverse | Lukas Jürgensmeier, Matteo Fina, Jan Bischoff | based on <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- class: middle # What is in a dataset? --- ## Dataset terminology - Each row is an **observation** - Each column is a **variable** .small[ ```r starwars ``` ``` ## # A tibble: 87 x 14 ## name height mass hair_~1 skin_~2 eye_c~3 birth~4 sex gender ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 Luke~ 172 77 blond fair blue 19 male mascu~ ## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~ ## 3 R2-D2 96 32 <NA> white,~ red 33 none mascu~ ## 4 Dart~ 202 136 none white yellow 41.9 male mascu~ ## 5 Leia~ 150 49 brown light brown 19 fema~ femin~ ## 6 Owen~ 178 120 brown,~ light blue 52 male mascu~ ## # ... with 81 more rows, 5 more variables: homeworld <chr>, ## # species <chr>, films <list>, vehicles <list>, ## # starships <list>, and abbreviated variable names ## # 1: hair_color, 2: skin_color, 3: eye_color, 4: birth_year ``` ] --- ## Luke Skywalker ![luke-skywalker](img/luke-skywalker.png) --- ## What's in the Star Wars data? Take a `glimpse` at the data: ```r glimpse(starwars) ``` ``` ## Rows: 87 ## Columns: 14 ## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth V~ ## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 1~ ## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, ~ ## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, gr~ ## $ skin_color <chr> "fair", "gold", "white, blue", "white", "lig~ ## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", ~ ## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, N~ ## $ sex <chr> "male", "none", "none", "male", "female", "m~ ## $ gender <chr> "masculine", "masculine", "masculine", "masc~ ## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine",~ ## $ species <chr> "Human", "Droid", "Droid", "Human", "Human",~ ## $ films <list> <"The Empire Strikes Back", "Revenge of the~ ## $ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <~ ## $ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TI~ ``` --- .question[ How many rows and columns does this dataset have? What does each row represent? What does each column represent? ] ```r ?starwars ``` <img src="img/starwars-help.png" width="60%" style="display: block; margin: auto;" /> --- .question[ How many rows and columns does this dataset have? ] .pull-left[ ```r nrow(starwars) # number of rows ``` ``` ## [1] 87 ``` ```r ncol(starwars) # number of columns ``` ``` ## [1] 14 ``` ```r dim(starwars) # dimensions (row column) ``` ``` ## [1] 87 14 ``` ] --- class: middle # Exploratory data analysis --- ## What is EDA? - **Exploratory data analysis (EDA)** is an approach to analysing data sets to summarize its main characteristics - Often, this is **visual** — this is what we'll focus on first - But we might also calculate summary statistics and perform data wrangling/manipulation/transformation at (or before) this stage of the analysis — this is what we'll focus on next --- ## Mass vs. height .question[ How would you describe the **relationship between mass and height** of Starwars characters? What other variables would help us understand data points that don't follow the overall trend? Who is the not so tall but really chubby character? ] <img src="u2-d01-data-viz_files/figure-html/unnamed-chunk-7-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Jabba! <img src="img/jabbaplot.png" width="80%" style="display: block; margin: auto;" /> --- class: middle # Data visualization --- ## Data visualization > *"The simple graph has brought more information to the data analyst's mind than any other device." — John Tukey* - Data visualization is the creation and study of the visual representation of data - Many tools for visualizing data — R is one of them - Many approaches/systems within R for making data visualizations — **ggplot2** is one of them, and that's what we're going to use --- ## ggplot2 `\(\in\)` tidyverse .pull-left[ <img src="img/ggplot2-part-of-tidyverse.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ - **ggplot2** is tidyverse's data visualization package - `gg` in "ggplot2" stands for Grammar of Graphics - Inspired by the book **Grammar of Graphics** by Leland Wilkinson ] --- ## Grammar of Graphics .pull-left-narrow[ A grammar of graphics is a tool that enables us to concisely describe the components of a graphic ] .pull-right-wide[ <img src="img/grammar-of-graphics.png" width="100%" style="display: block; margin: auto;" /> ] .footnote[ Source: [BloggoType](http://bloggotype.blogspot.com/2016/08/holiday-notes2-grammar-of-graphics.html)] --- ## Mass vs. height ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + labs(title = "Mass vs. height of Starwars characters", x = "Height (cm)", y = "Weight (kg)") ``` <img src="u2-d01-data-viz_files/figure-html/mass-height-1.png" width="50%" style="display: block; margin: auto;" /> --- .question[ - What are the functions doing the plotting? - What is the dataset being plotted? - Which variables map to which features (aesthetics) of the plot? - What does the warning mean?<sup>+</sup> ] ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + labs(title = "Mass vs. height of Starwars characters", x = "Height (cm)", y = "Weight (kg)") ``` ``` ## Warning: Removed 28 rows containing missing values ## (`geom_point()`). ``` .footnote[ <sup>+</sup>Suppressing warning to subsequent slides to save space ] --- ## Hello ggplot2! .pull-left-wide[ - `ggplot()` is the main function in ggplot2 - Plots are constructed in layers - Structure of the code for plots can be summarized as ```r ggplot(data = [dataset], mapping = aes(x = [x-variable], y = [y-variable])) + geom_xxx() + other options ``` - The ggplot2 package comes with the tidyverse ```r library(tidyverse) ``` - For help with ggplot2, see [ggplot2.tidyverse.org](http://ggplot2.tidyverse.org/) ] --- class: middle # Why do we visualize? --- ## Age at first kiss .question[ Do you see anything out of the ordinary? ] <img src="u2-d01-data-viz_files/figure-html/unnamed-chunk-14-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Facebook visits .question[ How are people reporting lower vs. higher values of FB visits? ] <img src="u2-d01-data-viz_files/figure-html/unnamed-chunk-15-1.png" width="60%" style="display: block; margin: auto;" />