class: center, middle, inverse, title-slide # Data types ##
Introduction to Data Science with R and Tidyverse ### based on datasciencebox.org --- layout: true <div class="my-footer"> <span> Introduction to Data Science with R and Tidyverse | Lukas Jürgensmeier, Matteo Fina, Jan Bischoff | based on <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- class: middle # Why should you care about data types? --- ## Example: Cat lovers A survey asked respondents their name and number of cats. The instructions said to enter the number of cats as a numerical value. ```r cat_lovers <- read_csv("data/cat-lovers.csv") ``` ``` ## # A tibble: 60 x 3 ## name number_of_cats handedness ## <chr> <chr> <chr> ## 1 Bernice Warren 0 left ## 2 Woodrow Stone 0 left ## 3 Willie Bass 1 left ## 4 Tyrone Estrada 3 left ## 5 Alex Daniels 3 left ## 6 Jane Bates 2 left ## # ... with 54 more rows ``` --- ## Oh why won't you work?! ```r cat_lovers %>% summarise(mean_cats = mean(number_of_cats)) ``` ``` ## Warning: There was 1 warning in `summarise()`. ## i In argument: `mean_cats = mean(number_of_cats)`. ## Caused by warning in `mean.default()`: ## ! argument is not numeric or logical: returning NA ``` ``` ## # A tibble: 1 x 1 ## mean_cats ## <dbl> ## 1 NA ``` --- ```r ?mean ``` <img src="img/mean-help.png" width="75%" style="display: block; margin: auto;" /> --- ## Oh why won't you still work??!! ```r cat_lovers %>% summarise(mean_cats = mean(number_of_cats, na.rm = TRUE)) ``` ``` ## Warning: There was 1 warning in `summarise()`. ## i In argument: `mean_cats = mean(number_of_cats, na.rm = TRUE)`. ## Caused by warning in `mean.default()`: ## ! argument is not numeric or logical: returning NA ``` ``` ## # A tibble: 1 x 1 ## mean_cats ## <dbl> ## 1 NA ``` --- ## Take a breath and look at your data .question[ What is the type of the `number_of_cats` variable? ] ```r glimpse(cat_lovers) ``` ``` ## Rows: 60 ## Columns: 3 ## $ name <chr> "Bernice Warren", "Woodrow Stone", "Will~ ## $ number_of_cats <chr> "0", "0", "1", "3", "3", "2", "1", "1", ~ ## $ handedness <chr> "left", "left", "left", "left", "left", ~ ``` --- ## Let's take another look Check out the responses from "Ginger Clark" and "Doug Bass" .small[
] --- ## Sometimes you might need to babysit your respondents .midi[ ```r cat_lovers %>% mutate(number_of_cats = case_when( name == "Ginger Clark" ~ 2, name == "Doug Bass" ~ 3, TRUE ~ as.numeric(number_of_cats) )) %>% summarise(mean_cats = mean(number_of_cats)) ``` ``` ## Warning: There was 1 warning in `mutate()`. ## i In argument: `number_of_cats = case_when(...)`. ## Caused by warning: ## ! NAs introduced by coercion ``` ``` ## # A tibble: 1 x 1 ## mean_cats ## <dbl> ## 1 0.833 ``` ] --- ## You always need to respect data types ```r cat_lovers %>% mutate( number_of_cats = case_when( name == "Ginger Clark" ~ "2", name == "Doug Bass" ~ "3", TRUE ~ number_of_cats ), number_of_cats = as.numeric(number_of_cats) ) %>% summarise(mean_cats = mean(number_of_cats)) ``` ``` ## # A tibble: 1 x 1 ## mean_cats ## <dbl> ## 1 0.833 ``` --- ## Now that we know what we're doing... ```r *cat_lovers <- cat_lovers %>% mutate( number_of_cats = case_when( name == "Ginger Clark" ~ "2", name == "Doug Bass" ~ "3", TRUE ~ number_of_cats ), number_of_cats = as.numeric(number_of_cats) ) ``` --- ## Moral of the story - If your data does not behave how you expect it to, type coercion upon reading in the data might be the reason. - Go in and investigate your data, apply the fix, *save your data*, live happily ever after. --- class: middle .light-blue[...now that we have a good motivation for] .light-blue[learning about data types in R] <br> .large[ .light-blue[let's learn about data types in R!] ] --- class: middle # Data types --- ## Data types in R - **logical** - **double** - **integer** - **character** - and some more, but we won't be focusing on those --- ## Logical & character .pull-left[ **logical** - boolean values `TRUE` and `FALSE` ```r typeof(TRUE) ``` ``` ## [1] "logical" ``` ] .pull-right[ **character** - character strings ```r typeof("hello") ``` ``` ## [1] "character" ``` ] --- ## Double & integer .pull-left[ **double** — floating point numerical values (default numerical type) ```r typeof(1.335) ``` ``` ## [1] "double" ``` ```r typeof(7) ``` ``` ## [1] "double" ``` ] .pull-right[ **integer** — integer numerical values (indicated with an `L`) ```r typeof(7L) ``` ``` ## [1] "integer" ``` ```r typeof(1:3) ``` ``` ## [1] "integer" ``` ] --- ## Concatenation Vectors can be constructed using the `c()` function. ```r c(1, 2, 3) ``` ``` ## [1] 1 2 3 ``` ```r c("Hello", "World!") ``` ``` ## [1] "Hello" "World!" ``` ```r c(c("hi", "hello"), c("bye", "jello")) ``` ``` ## [1] "hi" "hello" "bye" "jello" ``` --- ## Converting between types with intention... .pull-left[ ```r x <- 1:3 x ``` ``` ## [1] 1 2 3 ``` ```r typeof(x) ``` ``` ## [1] "integer" ``` ] -- .pull-right[ ```r y <- as.character(x) y ``` ``` ## [1] "1" "2" "3" ``` ```r typeof(y) ``` ``` ## [1] "character" ``` ] --- ## Converting between types with intention... .pull-left[ ```r x <- c(TRUE, FALSE) x ``` ``` ## [1] TRUE FALSE ``` ```r typeof(x) ``` ``` ## [1] "logical" ``` ] -- .pull-right[ ```r y <- as.numeric(x) y ``` ``` ## [1] 1 0 ``` ```r typeof(y) ``` ``` ## [1] "double" ``` ] --- ## Converting between types without intention... R will happily convert between various types without complaint when different types of data are concatenated in a vector, and that's not always a great thing! .pull-left[ ```r c(1, "Hello") ``` ``` ## [1] "1" "Hello" ``` ```r c(FALSE, 3L) ``` ``` ## [1] 0 3 ``` ] -- .pull-right[ ```r c(1.2, 3L) ``` ``` ## [1] 1.2 3.0 ``` ```r c(2L, "two") ``` ``` ## [1] "2" "two" ``` ] --- ## Explicit vs. implicit coercion Let's give formal names to what we've seen so far: -- - **Explicit coercion** is when you call a function like `as.logical()`, `as.numeric()`, `as.integer()`, `as.double()`, or `as.character()` -- - **Implicit coercion** happens when you use a vector in a specific context that expects a certain type of vector --- class: middle # Special values --- ## Special values - `NA`: Not available - `NaN`: Not a number - `Inf`: Positive infinity - `-Inf`: Negative infinity -- .pull-left[ ```r pi / 0 ``` ``` ## [1] Inf ``` ```r 0 / 0 ``` ``` ## [1] NaN ``` ] .pull-right[ ```r 1/0 - 1/0 ``` ``` ## [1] NaN ``` ```r 1/0 + 1/0 ``` ``` ## [1] Inf ``` ] <!-- --- --> <!-- ## `NA`s are special ❄️s --> <!-- ```{r} --> <!-- x <- c(1, 2, 3, 4, NA) --> <!-- ``` --> <!-- ```{r} --> <!-- mean(x) --> <!-- mean(x, na.rm = TRUE) --> <!-- summary(x) --> <!-- ``` --> <!-- --- --> <!-- ## `NA`s are logical --> <!-- R uses `NA` to represent missing values in its data structures. --> <!-- ```{r} --> <!-- typeof(NA) --> <!-- ``` --> <!-- --- --> <!-- ## Mental model for `NA`s --> <!-- - Unlike `NaN`, `NA`s are genuinely unknown values --> <!-- - But that doesn't mean they can't function in a logical way --> <!-- - Let's think about why `NA`s are logical... --> <!-- -- --> <!-- .question[ --> <!-- Why do the following give different answers? --> <!-- ] --> <!-- .pull-left[ --> <!-- ```{r} --> <!-- # TRUE or NA --> <!-- TRUE | NA --> <!-- ``` --> <!-- ] --> <!-- .pull-right[ --> <!-- ```{r} --> <!-- # FALSE or NA --> <!-- FALSE | NA --> <!-- ``` --> <!-- ] --> <!-- `\(\rightarrow\)` See next slide for answers... --> <!-- --- --> <!-- - `NA` is unknown, so it could be `TRUE` or `FALSE` --> <!-- .pull-left[ --> <!-- .midi[ --> <!-- - `TRUE | NA` --> <!-- ```{r} --> <!-- TRUE | TRUE # if NA was TRUE --> <!-- TRUE | FALSE # if NA was FALSE --> <!-- ``` --> <!-- ] --> <!-- ] --> <!-- .pull-right[ --> <!-- .midi[ --> <!-- - `FALSE | NA` --> <!-- ```{r} --> <!-- FALSE | TRUE # if NA was TRUE --> <!-- FALSE | FALSE # if NA was FALSE --> <!-- ``` --> <!-- ] --> <!-- ] --> <!-- - Doesn't make sense for mathematical operations --> <!-- - Makes sense in the context of missing data -->