recipes for Dummies

This short post can be viewed as an unofficial appendix to Grant McDermott’s terrific lecture on “Regression analysis in R” (go read it!). In particular, it is meant to extend the “Dummy variables” section of that lecture by introducing you to the recipes package, authored by Max Kuhn and Hadley Wickham.

The recipes Package

recipes basically provides a “tidy” approach to data preprocessing. Though recipes’ true greatness reveals itself in the “feature engineering” stage of building machine learning models, I find it extremely useful even for the simple task of generating dummy variables before running a linear regression.

The approach of recepies, as its name hints, is related to the process of cooking (or baking…). Your variables are the ingredients and recipes’ collection of step_{X} functions define what you want to do with your ingredient. If you follow the recipe’s instructions carefully you will end up with a new (and tasty) data frame that includes the new and transformed variables you need.

Generating dummies using recipes

In this tutorial, we will focus on one recipes function called step_dummy that makes the task of generating dummies a breeze.

We start by loading the tidyverse and recipes packages:

library(tidyverse)
library(recipes)

Like Grant, we’ll be working with the starwars data frame.

starwars
## # A tibble: 87 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke~    172    77 blond      fair       blue            19   male  
##  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
##  3 R2-D2     96    32 <NA>       white, bl~ red             33   <NA>  
##  4 Dart~    202   136 none       white      yellow          41.9 male  
##  5 Leia~    150    49 brown      light      brown           19   female
##  6 Owen~    178   120 brown, gr~ light      blue            52   male  
##  7 Beru~    165    75 brown      light      blue            47   female
##  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
##  9 Bigg~    183    84 black      light      brown           24   male  
## 10 Obi-~    182    77 auburn, w~ fair       blue-gray       57   male  
## # ... with 77 more rows, and 5 more variables: homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>

Lets get down to cooking. We first need to prepare our dataframe. In this case, we will use Grant’s humans dataframe.

humans <- starwars %>% 
  filter(species == "Human") %>%
  select(name:species)
humans
## # A tibble: 35 x 10
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke~    172    77 blond      fair       blue            19   male  
##  2 Dart~    202   136 none       white      yellow          41.9 male  
##  3 Leia~    150    49 brown      light      brown           19   female
##  4 Owen~    178   120 brown, gr~ light      blue            52   male  
##  5 Beru~    165    75 brown      light      blue            47   female
##  6 Bigg~    183    84 black      light      brown           24   male  
##  7 Obi-~    182    77 auburn, w~ fair       blue-gray       57   male  
##  8 Anak~    188    84 blond      fair       blue            41.9 male  
##  9 Wilh~    180    NA auburn, g~ fair       blue            64   male  
## 10 Han ~    180    80 brown      fair       brown           29   male  
## # ... with 25 more rows, and 2 more variables: homeworld <chr>,
## #   species <chr>

Now it is time to define the ingredients of our recipe, which are basically the variables in humans:

humans_rec <- humans %>% 
  recipe(mass ~ .)

summary(humans_rec)
## # A tibble: 10 x 4
##    variable   type    role      source  
##    <chr>      <chr>   <chr>     <chr>   
##  1 name       nominal predictor original
##  2 height     numeric predictor original
##  3 hair_color nominal predictor original
##  4 skin_color nominal predictor original
##  5 eye_color  nominal predictor original
##  6 birth_year numeric predictor original
##  7 gender     nominal predictor original
##  8 homeworld  nominal predictor original
##  9 species    nominal predictor original
## 10 mass       numeric outcome   original

Note that we’ve defined mass as our “outcome” variable and the rest of the variables are defined as “predictors” (this is how ML folks call dependent and independent variables).

In the next step, we will write down our recipe for our variables (yeah. I know. Recipe for humans. Yuck. I blame Grant for choosing this data frame identifier…). Each step in the recipe contains instructions about what to do to some or all our variables included in that step.

In the following example, we will use step_dummy to generate numeric (zero-one) columns for each possible category of hair_color and skin_color. Then, we will use the prep function in order to associate our recipe with the humans data frame.

humans_cell <- humans_rec %>% 
  step_dummy(skin_color, hair_color) %>%
  prep(training = humans)
summary(humans_cell)
## # A tibble: 22 x 4
##    variable         type    role      source  
##    <chr>            <chr>   <chr>     <chr>   
##  1 name             nominal predictor original
##  2 height           numeric predictor original
##  3 eye_color        nominal predictor original
##  4 birth_year       numeric predictor original
##  5 gender           nominal predictor original
##  6 homeworld        nominal predictor original
##  7 species          nominal predictor original
##  8 mass             numeric outcome   original
##  9 skin_color_fair  numeric predictor derived 
## 10 skin_color_light numeric predictor derived 
## # ... with 12 more rows

Note that now we’ve added 12 new variable definitions to our dataframe. These are the (straightforward) names of our new dummies. For example, in the 9th row, you can find skin_color_fair, the dummy for skin_color == "fair".

Calling the juice() function generates a new data frame according to our predefined recipe.

humans_juiced <- juice(humans_cell)
humans_juiced
## # A tibble: 35 x 22
##    name  height eye_color birth_year gender homeworld species  mass
##    <fct>  <int> <fct>          <dbl> <fct>  <fct>     <fct>   <dbl>
##  1 Luke~    172 blue            19   male   Tatooine  Human      77
##  2 Dart~    202 yellow          41.9 male   Tatooine  Human     136
##  3 Leia~    150 brown           19   female Alderaan  Human      49
##  4 Owen~    178 blue            52   male   Tatooine  Human     120
##  5 Beru~    165 blue            47   female Tatooine  Human      75
##  6 Bigg~    183 brown           24   male   Tatooine  Human      84
##  7 Obi-~    182 blue-gray       57   male   Stewjon   Human      77
##  8 Anak~    188 blue            41.9 male   Tatooine  Human      84
##  9 Wilh~    180 blue            64   male   Eriadu    Human      NA
## 10 Han ~    180 brown           29   male   Corellia  Human      80
## # ... with 25 more rows, and 14 more variables: skin_color_fair <dbl>,
## #   skin_color_light <dbl>, skin_color_pale <dbl>, skin_color_tan <dbl>,
## #   skin_color_white <dbl>, hair_color_auburn..grey <dbl>,
## #   hair_color_auburn..white <dbl>, hair_color_black <dbl>,
## #   hair_color_blond <dbl>, hair_color_brown <dbl>,
## #   hair_color_brown..grey <dbl>, hair_color_grey <dbl>,
## #   hair_color_none <dbl>, hair_color_white <dbl>

Done! Now, let’s take a closer look at our new skin_color dummies:

humans_juiced %>% 
  select(starts_with("skin_color"))
## # A tibble: 35 x 5
##    skin_color_fair skin_color_light skin_color_pale skin_color_tan
##              <dbl>            <dbl>           <dbl>          <dbl>
##  1               1                0               0              0
##  2               0                0               0              0
##  3               0                1               0              0
##  4               0                1               0              0
##  5               0                1               0              0
##  6               0                1               0              0
##  7               1                0               0              0
##  8               1                0               0              0
##  9               1                0               0              0
## 10               1                0               0              0
## # ... with 25 more rows, and 1 more variable: skin_color_white <dbl>

As you can see, instead of skin_color we now have five zero-one numeric (<dbl>) columns, each corresponding to a specific category of skin_color, excluding “dark” which is set as the base category. Note that unless instructed othewise, step_dummy results in \(C\)-1 dummies, where \(C\) is the number of categories. I.e., step_dummy excludes one category by default.

NOTE: Unlike lm(), which automatically handles factor variables for you, with most machine learning it well advised to work with numeric columns as input. As we just saw, recipes was built with this feature in mind.

Thanks to the pipe operator we can do all of the above in a single command:

humans_juiced <- humans %>% 
  recipe( ~ .) %>% 
  step_dummy(hair_color, skin_color) %>%
  prep() %>% 
  juice()
humans_juiced
## # A tibble: 35 x 22
##    name  height  mass eye_color birth_year gender homeworld species
##    <fct>  <int> <dbl> <fct>          <dbl> <fct>  <fct>     <fct>  
##  1 Luke~    172    77 blue            19   male   Tatooine  Human  
##  2 Dart~    202   136 yellow          41.9 male   Tatooine  Human  
##  3 Leia~    150    49 brown           19   female Alderaan  Human  
##  4 Owen~    178   120 blue            52   male   Tatooine  Human  
##  5 Beru~    165    75 blue            47   female Tatooine  Human  
##  6 Bigg~    183    84 brown           24   male   Tatooine  Human  
##  7 Obi-~    182    77 blue-gray       57   male   Stewjon   Human  
##  8 Anak~    188    84 blue            41.9 male   Tatooine  Human  
##  9 Wilh~    180    NA blue            64   male   Eriadu    Human  
## 10 Han ~    180    80 brown           29   male   Corellia  Human  
## # ... with 25 more rows, and 14 more variables:
## #   hair_color_auburn..grey <dbl>, hair_color_auburn..white <dbl>,
## #   hair_color_black <dbl>, hair_color_blond <dbl>,
## #   hair_color_brown <dbl>, hair_color_brown..grey <dbl>,
## #   hair_color_grey <dbl>, hair_color_none <dbl>, hair_color_white <dbl>,
## #   skin_color_fair <dbl>, skin_color_light <dbl>, skin_color_pale <dbl>,
## #   skin_color_tan <dbl>, skin_color_white <dbl>

More steps

recipes comes with many helpful preprocessing steps. For example, step_interact generates columns with interaction terms, step_log preforms log transformation, and step_pca replaces highly correlated variables with their principal component(s). Here is a complete list of steps:

##  [1] "step_arrange"       "step_bagimpute"     "step_bin2factor"   
##  [4] "step_BoxCox"        "step_bs"            "step_center"       
##  [7] "step_classdist"     "step_corr"          "step_count"        
## [10] "step_date"          "step_depth"         "step_discretize"   
## [13] "step_downsample"    "step_dummy"         "step_factor2string"
## [16] "step_filter"        "step_geodist"       "step_holiday"      
## [19] "step_hyperbolic"    "step_ica"           "step_integer"      
## [22] "step_interact"      "step_intercept"     "step_inverse"      
## [25] "step_invlogit"      "step_isomap"        "step_knnimpute"    
## [28] "step_kpca"          "step_lag"           "step_lincomb"      
## [31] "step_log"           "step_logit"         "step_lowerimpute"  
## [34] "step_meanimpute"    "step_medianimpute"  "step_modeimpute"   
## [37] "step_mutate"        "step_naomit"        "step_nnmf"         
## [40] "step_novel"         "step_ns"            "step_num2factor"   
## [43] "step_nzv"           "step_ordinalscore"  "step_other"        
## [46] "step_pca"           "step_pls"           "step_poly"         
## [49] "step_profile"       "step_range"         "step_ratio"        
## [52] "step_regex"         "step_relu"          "step_rm"           
## [55] "step_rollimpute"    "step_sample"        "step_scale"        
## [58] "step_shuffle"       "step_slice"         "step_spatialsign"  
## [61] "step_sqrt"          "step_string2factor" "step_unorder"      
## [64] "step_upsample"      "step_window"        "step_YeoJohnson"   
## [67] "step_zv"

Further resources

Avatar
Itamar Caspi
Economist

DISCLAIMER: This website and its content do not reflect the views of the Bank of Israel.