-
Notifications
You must be signed in to change notification settings - Fork 0
/
drivendata-water-pumps.Rmd
324 lines (276 loc) Β· 18.6 KB
/
drivendata-water-pumps.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
---
title: "drivendata-water-pumps"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
collapse = TRUE,
comment = "#>",
out.width = "100%"
)
options(width = 125)
library(tidyverse)
library(caret)
library(ggrepel)
library(ggspatial)
library(lubridate)
library(rnaturalearth)
library(recipes)
library(rvest)
library(skimr)
library(here)
```
## Predicting waterpump functionality in Tanzania
With its mission of social impact, [DrivenData](https://www.drivendata.org/#) puts an laudable spin on data science competitions. Of course, coming from the public sector and world of global health, I did not have to muster much motivation to try my hand at one of their competitions. So, I chose to enter the [Pump It up: Data Mining the Water Table](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/) competition that involves a multiclass classification problem to predict water pump functionality in Tanzania.
```{r map-wells, eval=FALSE, echo=FALSE}
wells <- train_dta %>%
left_join(train_labs, by = "id") %>%
filter(type == "train",
longitude > 0) %>% # 100s of watrpoints in Atlantic Ocean // addressed in preprocessing
select(id, region, latitude, longitude, status_group) %>%
sample_n(1000)
tza_base <- ne_states(iso_a2 = "TZ", returnclass = "sf")
ggplot() +
geom_sf(data = tza_base, fill = "antiquewhite") +
geom_jitter(data = wells, aes(x = longitude, y = latitude, fill = status_group),
size = 1.5, shape = 23, alpha = 0.6) +
geom_text_repel(data = select(region_refs, reg_capital, cap_lat, cap_lon),
aes(x = cap_lon, y = cap_lat, label = reg_capital),
size = 3, color = "black", fontface = "bold", min.segment.length = 0) +
scale_fill_manual(name = "", values = c("forestgreen", "goldenrod", "firebrick")) +
annotation_scale(location = "bl") +
theme_bw() +
theme(panel.grid.major = element_line(color = gray(0.5), linetype = "dashed", size = 0.5),
legend.position = "bottom") +
labs(title = "Location of wells and function status",
subtitle = str_c("n=", nrow(wells)),
x = "longitude", y = "latitude")
```
<center>
![](data/map-wells.png)
</center>
As a side note, I found that Max Kuhn's [Feature Engineering and Selection](https://bookdown.org/max/FES/) to be very instructive and an excellent resource for exploring and implementing feature engineering methods.
### I. Preprocessing and feature engineering
As we will see shortly, this dataset is filled with lots of missing data, sparse factor levels, and imbalanced classes. In order to address issues like these, we need to survey the dataset then take systematic steps to preprocess some features (e.g. imputation; scaling and centering; transformations; encoding factors), drop features, or create new features altogether.
#### Inspect training and test sets
To start, let's load the provided datasets and assess the landscape using the excellent [{skimr}](https://ropensci.github.io/skimr/) package that prints a diagnostic of a given dataframe that includes summary statistics, degree of missingness, unique values, and even mini histograms!
```{r load-data, include=FALSE}
train_labs <- read_csv(here("data","training-set_labels.csv")) %>% mutate(status_group = factor(status_group))
train_dta <- read_csv(here("data","training-set_values.csv")) %>% mutate(type = "train")
```
```{r skim-data, message=FALSE, out.width="100%"}
skim(train_dta)
```
Alright, now we have a sense of the data and we can start sorting out what kind of preprocessing and feature engineering might be warranted. So, let's inspect the categorical variables to start: we see that there are sets of very similar variables, e.g. `extraction_type`, `extraction_type_class`, and `extraction_type_group`. As you see below, the `extraction` variables provide finer degrees of specification moving from `*_class` to `*_type`. Retaining slight variations of the same feature may adversely affect the model's performance so we could consider recoding or collapsing the factor levels and dropping related variables altogether.
```{r}
train_dta %>%
select_at(vars(starts_with("extraction"))) %>%
map(unique)
```
Additionally, we can see from the `skim` that there are a handful of categorical variables with thousands of unique values, which may make sense for `subvillage` or `wpt_name` but likely not so useful for `funder` or `installer` where we can better combine categories - we will deal with those variables shortly.
```{r, warning=FALSE}
train_dta %>%
select(funder, installer) %>%
map(~count(data.frame(x = .x), x) %>% arrange(desc(n)))
```
We also see there are quite a few variables with non-trivial amounts of missing data (`scheme_name`, `permit`, `public meeting`) and more importantly there are several variables with a proliferation of `0` values where they were likely used instead of `NA`. To address this issue, we will convert the `0`s to `NA`s and treat those `NA` values as an explicit `unknown` factor level.
Lastly, there are nonsensical coordinates concentrated in the `Shinyanga` and `Mwanza` regions, so we plant a flag there to address through imputation later on. These waterpoint coordinates in the Atlantic Ocean only came to my attention when creating the map seen above. Yet another instance of "always visualize your data!"
```{r}
train_dta %>%
count(region, latitude, longitude) %>%
arrange(desc(n))
```
#### Creating features from external data sources
Now that we have identified areas for preprocessing, we should start thinking of new features that can complement the existing features and hopefully lend more information to the model.
It seems reasonable that the physical remoteness of a waterpoint would be predictive so why don't we try engineer some features to that end. In particular, a well's proximity to the next largest city could provide a good proxy for accessibility and likelihood of maintenance. In order to do this, we grab each regional capital's coordinates and subsequently use `geosphere::distGeo()` to calculate the distance between the regional capital and the waterpoint as well as `recipes::step_geodist()` to calculate the waterpoint's absolute distance and from Dar es Salaam.
```{r grab-regional, eval=FALSE, message=FALSE}
## grab names of Tanzania's regional capitals
region_caps <- "https://en.wikipedia.org/wiki/Regions_of_Tanzania" %>%
read_html() %>%
html_nodes(xpath = '//*[@id="mw-content-text"]/div/table') %>%
html_table() %>%
as.data.frame() %>%
janitor::clean_names() %>%
mutate(capital = str_replace(capital, "\\-", " ") %>% str_remove("\\sCity"),
region = str_remove_all(region, "\\sRegion")) %>%
select(region, reg_capital = capital)
## grab regional capital coordinates [https://simplemaps.com/data/world-cities] and join to names
region_refs <- read_csv(here("data", "worldcities.csv")) %>%
filter(iso3 == "TZA", capital %in% c("admin","primary")) %>%
mutate(capital = str_remove_all(capital, "city")) %>%
select(reg_capital = city, cap_lat = lat, cap_lon = lng) %>%
right_join(region_caps, by = "reg_capital")
```
#### Cleaning, preprocessing, feature engineering and selection
The below code documents the various processes undertaken to prepare the input data. These steps include: imputing missing data; string manipulation; addressing `NA` values; harmonizing class types; encoding categorical variables; creating new features; and, dropping redundant variables.
```{r preprocess, eval=FALSE}
prep_df <- bind_rows(train_dta, submit_dta) %>%
# join regional capitals dataframe
left_join(region_refs, by = "region") %>%
# calculate mean lat-lon per district, region, and basin
group_by(region_code, district_code) %>%
mutate(dist_lon = mean(longitude, na.rm = TRUE),
dist_lat = mean(latitude, na.rm = TRUE),
dist_gps = mean(gps_height, na.rm = TRUE)) %>%
ungroup() %>%
group_by(region) %>%
mutate(reg_lon = mean(longitude, na.rm = TRUE),
reg_lat = mean(latitude, na.rm = TRUE),
reg_gps = mean(gps_height, na.rm = TRUE)) %>%
ungroup() %>%
mutate(basin_lon = mean(longitude, na.rm = TRUE),
basin_lat = mean(latitude, na.rm = TRUE),
basin_gps = mean(gps_height, na.rm = TRUE)) %>%
ungroup() %>%
# impute district-level mean lat-lons where value are nonsensical
mutate(
longitude = case_when(
longitude == 0 & dist_lon != 0 ~ dist_lon,
longitude == 0 & reg_lon != 0 ~ reg_lon,
TRUE ~ longitude
),
latitude = case_when(
latitude == -0.00000002 & dist_lat != -0.00000002 ~ dist_lat,
latitude == -0.00000002 & reg_lat != -0.00000002 ~ reg_lat,
TRUE ~ latitude
),
gps_height = case_when(
gps_height == 0 & dist_gps != 0 ~ dist_gps,
gps_height == 0 & reg_gps != 0 ~ reg_gps,
gps_height == 0 & basin_gps != 0 ~ basin_gps,
TRUE ~ gps_height
)
) %>%
# clean up variable types, factor levels, deal with missingness
mutate_if(is.character, ~str_to_lower(.) %>% str_squish(.)) %>% # harmonize case and whitespace
mutate_at(vars(-id, -source), ~na_if(., 0)) %>% # 0 ==> <NA>
mutate_if(is.character, as.factor) %>% # convert <chr> to <fct>
mutate_if(is.factor, fct_explicit_na, na_level = "unknown") %>% # make <NA> explicit level
mutate(extraction_type = recode(extraction_type,
"cemo" = "other motorpump",
"climax" = "other motorpump",
"other - mkulima/shinyanga" = "other handpump",
"other - play pump" = "other handpump",
"walimi" = "other handpump",
"other - swn 81" = "swn",
"swn 80" = "swn",
"india mark ii" = "india mark",
"india mark iii" = "india mark"),
# retain top 50 most frequent levels and lump remainder into "other"
funder_cat = fct_infreq(funder) %>% fct_lump(n = 50, other_level = "other"),
installer_cat = fct_infreq(istaller) %>% fct_lump(n = 50, other_level = "other"),
# encode date variables
date_recorded = ymd(date_recorded),
weekend = if_else(wday(date_recorded) %in% c(1, 7), 0, 1),
days_abs = as.numeric(as_date("2014-01-01") - date_recorded),
day_of_year = yday(date_recorded),
month_recorded = month(date_recorded),
# create feature for rainy/dry seasons
season = case_when(
month_recorded %in% 1:2 ~ "dry_short",
month_recorded %in% 3:5 ~ "wet_long",
month_recorded %in% 6:10 ~ "dry_long",
month_recorded %in% 11:12 ~ "wet_short"
),
# add feature for years in operation
operation_years = lubridate::year(date_recorded) - construction_year,
operation_years = if_else(operation_years < 0, "unknown", as.character(operation_years))
) %>%
# add feature for distance to regional capital
rowwise() %>%
mutate(dist_from_cap = geosphere::distGeo(c(latitude, longitude), c(cap_lat, cap_lon))) %>%
ungroup() %>%
# drop extraneous variables
select(
-recorded_by,
-wpt_name, -subvillage, -lga, -ward, -scheme_name,
-extraction_type, -extraction_type_group, -payment,
-quality_group, -quantity_group, -source_type, -waterpoint_type_group,
-funder, -installer, -num_private,
-public_meeting, -permit, -region_code, -district_code,
-reg_capital, -cap_lat, -cap_lon,
-dist_lon, -dist_lat, -dist_gps,
-reg_lon, -reg_lat, -reg_gps,
-basin_lat, -basin_lon, -basin_gps
)
```
### II. Model
#### Model recipe and final preprocessing
With the emergence of a suite of packages for modelling in the [`tidyverse`](https://github.com/tidymodels), I opted to employ the [recipes](https://tidymodels.github.io/recipes/) package to run the remaining less bespoke preprocessing steps. In particular, I utilize `step_geodist` to calculate new features based on a waterpoint's distance to fixed points, `step_meaninpute` to impute any remaining `numeric` variables, `step_modeimpute` for any remaining `nominal` variables, `step_log` to transform highly skewed variables, and a final `check_missing` before modelling.
Although, I had a fair amount of preprocessing to accomplish by hand, I can see the upside of creating an easy-to-digest blueprint for your feature engineering process using a consistent interface. Moreover, the available `step_*` [functions](https://tidymodels.github.io/recipes/reference/index.html) cover lots of ground, including: imputation, transformation, normalization, discretization, dummy variables, interactions, correlations, and down/upsampling.
```{r recipe, eval=FALSE, message=FALSE}
rec_obj <- recipe( ~ ., train_df) %>%
step_geodist(name = "geo_dist_zero",
lat = latitude, lon = longitude, log = FALSE,
ref_lat = 0.0, ref_lon = 0.0) %>%
step_geodist(name = "geo_dist_dar",
lat = latitude, lon = longitude, log = FALSE,
ref_lat = -6.8, ref_lon = 39.2) %>%
step_meanimpute(all_numeric(), -id) %>%
step_modeimpute(all_nominal()) %>%
step_log(amount_tsh, population, dist_from_cap, days_abs) %>%
# step_dummy() encoding dummy variables showed no improvement in performance and increased training time
check_missing(all_predictors()) %>%
prep(verbose = TRUE)
train_dta_preproc <- bake(rec_obj, train_df) %>%
left_join(., train_labs, by = "id") %>%
select(-id)
```
<br>
> ##### Sidenote: Plug in, iterate faster, learn more!
If you are like me, you are building out your ML skills and experience on your own time and hardware. These prove to be non-trivial constraints when working with more data or trying to iterate through ML approaches in any efficient, timely manner. So, after learning the hard way, I worked out that you don't have to have expensive, powerful hardware at hand - Amazon has plenty to rent and often at quite reasonable rates.
>I strongly suggest that you sign up for [AWS](https://aws.amazon.com/) and set up an AWS EC2 instance. Fortunately, [Louis Aslett](http://www.louisaslett.com/RStudio_AMI/) maintains an Rstudio image with all kinds of useful dependencies built-in (e.g. db drivers, Java, GDAL, Stan, Keras) so you can get up and running in no time. [Here](http://strimas.com/r/rstudio-cloud-1/) is an simple walk-through for getting this all set up.
#### Random Forest
`{ranger}` proved to be the best performing model of my early efforts after experimenting with `glmnet`, `knn`, `svmRadial`, and `nb`. Through trial and error, I found that the below set of hyperparameters (`min.node.size`, `mtry`, `splitrule`) resulted in the best classification rate: **`0.8204`**. I should note that this early effort was done on a laptop and it took nearly 6 hours to run.
```{r run-model, eval=FALSE}
cl <- makeCluster(detectCores())
registerDoParallel(cl)
set.seed(24)
model_rng <- train(
status_group ~ .,
data = train_dta_preproc,
method = "ranger",
metric = "Accuracy",
importance = "impurity",
tuneGrid = expand.grid(.min.node.size = 1:3, .mtry = c(24,33,46), .splitrule = c("extratrees","gini")),
trControl = trainControl(method = "cv", number = 5)
)
stopCluster(cl)
```
#### H2O AutoML
It was clear to me that I was being constrained by hardware and training time, so I set up an AWS EC2 instance (described above) and having seen [H2O](http://docs.h2o.ai/h2o/latest-stable/h2o-r/docs/index.html) being utilized more and more, I decided to give it a shot. Moreover, I was eager to find a more efficient way to cycle through various ML algorithms, grids, and ensembling approaches. Lucky for me H2O recently pushed its [AutoML](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html#) interface that trains and cross-validates a series of XGBoost, GLM, Random Forest, Deep Neural Net models; random grids of XGBoost GBMs, H2O GBMs, and Deep Neural Nets; and two stacked ensemble models. How could you pass that up and it can be implemented in just a few lines of code!
```{r run-h2o, eval=FALSE}
library(h2o)
h2o.init(nthreads = -1, max_mem_size = "<avail_mem_size_here>") # set according to available memory
# convert to h2o objects
train_dta_preproc_aml <- as.h2o(train_dta_preproc)
submit_dta_preproc_aml <- as.h2o(submit_dta_preproc)
# set model vars
response <- "status_group"
predictors <- setdiff(names(train_dta_preproc_aml), "status_group")
# run AutoML
model_aml <- h2o.automl(x = predictors, y = response,
training_frame = train_dta_preproc_aml,
seed = 2,
nfolds = 10,
balance_classes = TRUE,
max_runtime_secs = <time_in_secs>) # adjust maximum training time here
```
#### Performance
As you can see below and as I had hoped, H2O AutoML found a better performing model that increased my score to **`0.8239`** and put me in the **top 2%** of competitors. Ultimately, out of ~100 models, `XGBoost` outperformed other GBMs, random forests, neural nets, and stacked models.
<center>
![](data/drivendata-rank.png)
</center>
Let's have a look at the confusion matrix and variable importance to get a sense of where the model performs well and what variables are contributing the most:
```{r h2o_setup, include=FALSE}
library(h2o)
h2o.init()
model_aml <- h2o.loadModel(here("data","model_aml"))
```
```{r h2o_diag}
h2o.confusionMatrix(model_aml)
h2o.varimp(model_aml) %>% head(n = 20)
```
We can see from the confusion matrix that the model has a difficult time classifying `function needs repair` with an error rate of more than **25%**. With regard to variable importance, it is good to see that **4** of the top **5** most important features are related to lat-lon coordinates where we spent time preprocessing, feature engineering, and pulling in external datasets. Additionally, all of the time-related features that we created are in the top 20 as well.
That being said, if I were to continue tweaking the model, I would look at leaning on the coordinates data some more, possibly by creating new features related to this World Bank GIS dataset for [Tanzanian roads](https://datacatalog.worldbank.org/dataset/tanzania-roads). Given the proliferation of predictors with many levels in this dataset, I am also keen to try out [`FeatureHashing`](https://github.com/wush978/FeatureHashing) as described [here](https://bookdown.org/max/FES/encoding-predictors-with-many-categories.html). Lastly, I would dig deeper into the cases and characteristics of `function needs repair` as it proved to be the most stubborn class to predict.