Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Data and visualisation



Introduction to Data Science with R and Tidyverse

based on datasciencebox.org

1 / 21

What is in a dataset?

2 / 21

Dataset terminology

  • Each row is an observation
  • Each column is a variable
starwars
## # A tibble: 87 x 14
## name height mass hair_~1 skin_~2 eye_c~3 birth~4 sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white,~ red 33 none mascu~
## 4 Dart~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen~ 178 120 brown,~ light blue 52 male mascu~
## # ... with 81 more rows, 5 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>,
## # starships <list>, and abbreviated variable names
## # 1: hair_color, 2: skin_color, 3: eye_color, 4: birth_year
3 / 21

Luke Skywalker

luke-skywalker

4 / 21

What's in the Star Wars data?

Take a glimpse at the data:

glimpse(starwars)
## Rows: 87
## Columns: 14
## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth V~
## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 1~
## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, ~
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, gr~
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "lig~
## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", ~
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, N~
## $ sex <chr> "male", "none", "none", "male", "female", "m~
## $ gender <chr> "masculine", "masculine", "masculine", "masc~
## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine",~
## $ species <chr> "Human", "Droid", "Droid", "Human", "Human",~
## $ films <list> <"The Empire Strikes Back", "Revenge of the~
## $ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <~
## $ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TI~
5 / 21

How many rows and columns does this dataset have? What does each row represent? What does each column represent?

?starwars

6 / 21

How many rows and columns does this dataset have?

nrow(starwars) # number of rows
## [1] 87
ncol(starwars) # number of columns
## [1] 14
dim(starwars) # dimensions (row column)
## [1] 87 14
7 / 21

Exploratory data analysis

8 / 21

What is EDA?

  • Exploratory data analysis (EDA) is an approach to analysing data sets to summarize its main characteristics
  • Often, this is visual — this is what we'll focus on first
  • But we might also calculate summary statistics and perform data wrangling/manipulation/transformation at (or before) this stage of the analysis — this is what we'll focus on next
9 / 21

Mass vs. height

How would you describe the relationship between mass and height of Starwars characters? What other variables would help us understand data points that don't follow the overall trend? Who is the not so tall but really chubby character?

10 / 21

Jabba!

11 / 21

Data visualization

12 / 21

Data visualization

"The simple graph has brought more information to the data analyst's mind than any other device." — John Tukey

  • Data visualization is the creation and study of the visual representation of data
  • Many tools for visualizing data — R is one of them
  • Many approaches/systems within R for making data visualizations — ggplot2 is one of them, and that's what we're going to use
13 / 21

ggplot2 tidyverse

  • ggplot2 is tidyverse's data visualization package
  • gg in "ggplot2" stands for Grammar of Graphics
  • Inspired by the book Grammar of Graphics by Leland Wilkinson
14 / 21

Grammar of Graphics

A grammar of graphics is a tool that enables us to concisely describe the components of a graphic

Source: BloggoType

15 / 21

Mass vs. height

ggplot(data = starwars, mapping = aes(x = height, y = mass)) +
geom_point() +
labs(title = "Mass vs. height of Starwars characters",
x = "Height (cm)", y = "Weight (kg)")

16 / 21
  • What are the functions doing the plotting?
  • What is the dataset being plotted?
  • Which variables map to which features (aesthetics) of the plot?
  • What does the warning mean?+
ggplot(data = starwars, mapping = aes(x = height, y = mass)) +
geom_point() +
labs(title = "Mass vs. height of Starwars characters",
x = "Height (cm)", y = "Weight (kg)")
## Warning: Removed 28 rows containing missing values
## (`geom_point()`).

+Suppressing warning to subsequent slides to save space

17 / 21

Hello ggplot2!

  • ggplot() is the main function in ggplot2
  • Plots are constructed in layers
  • Structure of the code for plots can be summarized as
ggplot(data = [dataset],
mapping = aes(x = [x-variable], y = [y-variable])) +
geom_xxx() +
other options
  • The ggplot2 package comes with the tidyverse
library(tidyverse)
18 / 21

Why do we visualize?

19 / 21

Age at first kiss

Do you see anything out of the ordinary?

20 / 21

Facebook visits

How are people reporting lower vs. higher values of FB visits?

21 / 21

What is in a dataset?

2 / 21
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow