class: center, middle, title-slide # themis: dealing with imbalanced data by using synthetic oversampling ## useR2020 ### Emil Hvitfeldt --- ## Motivated Fictional Scenario You work at a healthcare startup The company's mission is to provide preventive care to lower overall medical costs A cancer screening is available and you have been tasked to develop a model that classifies customers that would benefit from it --- ## Modeling with tidymodels ```r customers_data <- read_csv("data/customers.csv") ``` -- ```r library(tidymodels) ``` --- ## Modeling with tidymodels ```r customers_data <- read_csv("data/customers.csv") ``` ```r library(tidymodels) [...] # Use company modeling template [...] ``` --- ## Modeling with tidymodels ```r customers_data <- read_csv("data/customers.csv") ``` ```r library(tidymodels) [...] # Use company modeling template [...] model_results %>% collect_metrics() ## # A tibble: 1 x 5 ## .metric .estimator mean n std_err ## <chr> <chr> <dbl> <int> <dbl> ## 1 accuracy binary 0.938 10 0.00570 ``` --- # Confusion Matrix -- <img src="index_files/figure-html/unnamed-chunk-11-1.png" width="700px" style="display: block; margin: auto;" /> --- # Class distribution <img src="index_files/figure-html/unnamed-chunk-12-1.png" width="700px" style="display: block; margin: auto;" /> --- ![:scale 90%](cartoons/non-proportional.png) --- ![:scale 90%](cartoons/proportional.png) --- # How to deal with unbalanced data --- # How to deal with unbalanced data - Use weights --- # How to deal with unbalanced data - Use weights - Ensemble Methods --- # How to deal with unbalanced data - Use weights - Ensemble Methods - Over-sampling --- # How to deal with unbalanced data - Use weights - Ensemble Methods - Over-sampling - Under-sampling --- # How to deal with unbalanced data - Use weights - Ensemble Methods - **Over-sampling** - **Under-sampling** --- # Definitions ### Over-sampling Creating additional oberservations for minority classes ### Under-sampling Remove observations from majority classes --- # Disclaimer All visualizations are done in two dimensions But methods generalize to higher dimensions Similarly most examples will only have two classes --- .hidden[ ## Title ] <img src="index_files/figure-html/unnamed-chunk-13-1.png" width="100%" style="display: block; margin: auto;" /> --- ## Randomly remove samples from majority <img src="index_files/figure-html/unnamed-chunk-14-1.png" width="100%" style="display: block; margin: auto;" /> --- ## Randomly remove samples from majority <img src="index_files/figure-html/unnamed-chunk-16-1.png" width="100%" style="display: block; margin: auto;" /> --- ## Randomly remove samples from majority <img src="index_files/figure-html/unnamed-chunk-17-1.png" width="100%" style="display: block; margin: auto;" /> --- # Over-sampling We want to create additional points But how should they be created? --- # Over-sampling We want to create additional points But how should they be created? - Duplicates of existing points --- # Over-sampling We want to create additional points But how should they be created? - Duplicates of existing points - Generate points around existing points --- # Over-sampling We want to create additional points But how should they be created? - Duplicates of existing points - Generate points around existing points - Create a generative model and sample points --- # Over-sampling We want to create additional points But how should they be created? - Duplicates of existing points - **Generate points around existing points** - Create a generative model and sample points --- # SMOTE -- - **S**ynthetic - **M**inority - **O**ver-sampling - **TE**chnique -- SMOTE is a cleaver technique which works by generating between existing points --- .pull-left[ ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-18-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .pull-left[ ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-19-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .pull-left[ **To SMOTE a point** ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-20-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .pull-left[ **To SMOTE a point** 1. Select a **.blue[point]** ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-21-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .pull-left[ **To SMOTE a point** 1. Select a point 1. Find n **nearest neighbors** inside the same class (n = 5) ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-22-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .pull-left[ **To SMOTE a point** 1. Select a point 1. Find n nearest neighbors inside the same class (n = 5) 1. Randomly pick 1 **neighbors** ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-23-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .pull-left[ **To SMOTE a point** 1. Select a point 1. Find n nearest neighbors inside the same class (n = 5) 1. Randomly pick 1 neighbors 1. Generate 1 **.blue[point]** randomly along the line ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-24-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .pull-left[ **To SMOTE a point** 1. Select a point 1. Find n nearest neighbors inside the same class (n = 5) 1. Randomly pick 1 neighbors 1. Generate 1 **.blue[point]** randomly along the line ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-25-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .pull-left[ # SMOTE If we want to create balance in between classes we simply generate majority_count - minority_count points ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-26-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .pull-left[ # Borderline SMOTE Points with with only its own class as neighbor are "safe". Points with with only other classes as neighbor are "lost". If more then half of the neighbors comes from a different class it is labeled "danger". ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-27-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .pull-left[ # Borderline SMOTE Only create new points around "danger" points ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-28-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .pull-left[ # Borderline SMOTE Variant: Between all neighbors, not just its own class ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-29-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .pull-left[ # ADASYN Points are selected proportional to how many neighbors are from a different class ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-30-1.png" width="100%" style="display: block; margin: auto;" /> ] --- # Implementation I need the methods that: - Can handle more then 2 classes, - Are fast and have low memory footprint and, - Can generate exactly N points --- # tidymodels/themis ![](images/github-page.png) --- ```r library(recipes) library(themis) library(modeldata) data(credit_data) sort(table(credit_data$Status, useNA = "always")) ``` ``` ## ## <NA> bad good ## 0 1254 3200 ``` ```r ds_rec <- recipe(Status ~ Age + Income + Assets, data = credit_data) %>% step_meanimpute(all_predictors()) %>% step_smote(Status) %>% prep() sort(table(juice(ds_rec)$Status, useNA = "always")) ``` ``` ## ## <NA> bad good ## 0 3200 3200 ``` --- ![](images/references.png) --- ![](images/tidyverse.png) https://www.tidyverse.org/blog/2020/02/themis-0-1-0/ --- class: center, middle # Thank you! ###
[EmilHvitfeldt](https://github.com/EmilHvitfeldt/) ###
[@Emil_Hvitfeldt](https://twitter.com/Emil_Hvitfeldt) ###
[emilhvitfeldt](linkedin.com/in/emilhvitfeldt/) ###
[www.hvitfeldt.me](www.hvitfeldt.me) Slides created via the R package [xaringan](https://github.com/yihui/xaringan).