class: center, middle, inverse, title-slide # Visualising categorical data ##
Introduction to Data Science with R and Tidyverse ### based on datasciencebox.org --- layout: true <div class="my-footer"> <span> Introduction to Data Science with R and Tidyverse | Lukas Jürgensmeier, Matteo Fina, Jan Bischoff | based on <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- <!-- class: middle --> <!-- # Recap --> <!-- --- --> <!-- ## Variables --> <!-- - **Numerical** variables can be classified as **continuous** or **discrete** based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively. --> <!-- - If the variable is **categorical**, we can determine if it is **ordinal** based on whether or not the levels have a natural ordering. --> <!-- --- --> ### Loan Data Thousands of loans made through the Lending Club, which is a platform that allows individuals to lend to other individuals. ```r library(openintro) loans <- loans_full_schema %>% select(loan_amount, interest_rate, term, grade, state, annual_income, homeownership, debt_to_income) glimpse(loans) ``` ``` ## Rows: 10,000 ## Columns: 8 ## $ loan_amount <int> 28000, 5000, 2000, 21600, 23000, 5000, 2~ ## $ interest_rate <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.72, ~ ## $ term <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 36, ~ ## $ grade <ord> C, C, D, A, C, A, C, B, C, A, C, B, C, B~ ## $ state <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, IL, ~ ## $ annual_income <dbl> 90000, 40000, 40000, 30000, 35000, 34000~ ## $ homeownership <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN, M~ ## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.46, ~ ``` --- class: middle # Bar plot --- ## Bar plot ```r ggplot(loans, aes(x = homeownership)) + geom_bar() ``` <img src="u2-d04-viz-cat_files/figure-html/unnamed-chunk-3-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Segmented bar plot ```r ggplot(loans, aes(x = homeownership, * fill = grade)) + geom_bar() ``` <img src="u2-d04-viz-cat_files/figure-html/unnamed-chunk-4-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Segmented bar plot ```r ggplot(loans, aes(x = homeownership, fill = grade)) + * geom_bar(position = "fill") ``` <img src="u2-d04-viz-cat_files/figure-html/unnamed-chunk-5-1.png" width="60%" style="display: block; margin: auto;" /> --- .question[ Which bar plot is a more useful representation for visualizing the relationship between homeownership and grade? ] .pull-left[ <img src="u2-d04-viz-cat_files/figure-html/unnamed-chunk-6-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="u2-d04-viz-cat_files/figure-html/unnamed-chunk-7-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Customizing bar plots .panelset[ .panel[.panel-name[Plot] <img src="u2-d04-viz-cat_files/figure-html/unnamed-chunk-8-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r *ggplot(loans, aes(y = homeownership, fill = grade)) + geom_bar(position = "fill") + * labs( * x = "Proportion", * y = "Homeownership", * fill = "Grade", * title = "Grades of Lending Club loans", * subtitle = "and homeownership of lendee" * ) ``` ] ] --- class: middle # Appendix: Relationships between numerical and categorical variables We won't discuss the following slides, but they can serve as a useful plotting inspiration. --- ## Violin plots ```r ggplot(loans, aes(x = homeownership, y = loan_amount)) + geom_violin() ``` <img src="u2-d04-viz-cat_files/figure-html/unnamed-chunk-9-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Ridge plots ```r library(ggridges) ggplot(loans, aes(x = loan_amount, y = grade, fill = grade, color = grade)) + geom_density_ridges(alpha = 0.5) ``` <img src="u2-d04-viz-cat_files/figure-html/unnamed-chunk-10-1.png" width="60%" style="display: block; margin: auto;" />