In my last post I discussed using `coefplot`

on `glmnet`

models and in particular discussed a brand new function, `coefpath`

, that uses `dygraphs`

to make an interactive visualization of the coefficient path.

Another new capability for version 1.2.5 of `coefplot`

is the ability to show coefficient plots from `xgboost`

models. Beyond fitting boosted trees and boosted forests, `xgboost`

can also fit a boosted Elastic Net. This makes it a nice alternative to `glmnet`

even though it might not have some of the same user niceties.

To illustrate, we use the same data as our previous post.

First, we load the packages we need and note the version numbers.

```
# list the packages that we load
# alphabetically for reproducibility
packages <- c('caret', 'coefplot', 'DT', 'xgboost')
# call library on each package
purrr::walk(packages, library, character.only=TRUE)
# some packages we will reference without actually loading
# they are listed here for complete documentation
packagesColon <- c('dplyr', 'dygraphs', 'knitr', 'magrittr', 'purrr', 'tibble', 'useful')
```

```
versions <- c(packages, packagesColon) %>%
purrr::map(packageVersion) %>%
purrr::map_chr(as.character)
packageDF <- tibble::data_frame(Package=c(packages, packagesColon), Version=versions) %>%
dplyr::arrange(Package)
knitr::kable(packageDF)
```

Package | Version |
---|---|

caret | 6.0.78 |

coefplot | 1.2.6 |

dplyr | 0.7.4 |

DT | 0.2 |

dygraphs | 1.1.1.4 |

knitr | 1.18 |

magrittr | 1.5 |

purrr | 0.2.4 |

tibble | 1.4.2 |

useful | 1.2.3 |

xgboost | 0.6.4 |

Then, we read the data. The data are available at https://www.jaredlander.com/data/manhattan_Train.rds with the CSV version at data.world. We also get validation data which is helpful when fitting `xgboost`

mdoels.

```
manTrain <- readRDS(url('https://www.jaredlander.com/data/manhattan_Train.rds'))
manVal <- readRDS(url('https://www.jaredlander.com/data/manhattan_Validate.rds'))
```

The data are about New York City land value and have many columns. A sample of the data follows. There’s an odd bug where you have to click on one of the column names for the data to display the actual data.

```
datatable(manTrain %>% dplyr::sample_n(size=1000), elementId='TrainingSampled',
rownames=FALSE,
extensions=c('FixedHeader', 'Scroller'),
options=list(
scroller=TRUE
))
```

While `glmnet`

automatically standardizes the input data, `xgboost`

does not, so we calculate that manually. We use `preprocess`

from `caret`

to compute the mean and standard deviation of each numeric column then use these later.

`preProc <- preProcess(manTrain, method=c('center', 'scale'))`

Just like with `glmnet`

, we need to convert our `tbl`

into an X (predictor) `matrix`

and a Y (response) `vector`

. Since we don’t have to worry about multicolinearity with `xgboost`

we do not want to drop the baselines of `factor`

s. We also take advantage of sparse matrices since that reduces memory usage and compute, even though this dataset is not that large.

In order to build the `matrix`

and `vector`

we need a `formula`

. This could be built programmatically, but we can just build it ourselves. The response is `TotalValue`

.

```
valueFormula <- TotalValue ~ FireService + ZoneDist1 + ZoneDist2 +
Class + LandUse + OwnerType + LotArea + BldgArea + ComArea + ResArea +
OfficeArea + RetailArea + NumBldgs + NumFloors + UnitsRes + UnitsTotal +
LotDepth + LotFront + BldgFront + LotType + HistoricDistrict + Built +
Landmark
```

```
manX <- useful::build.x(valueFormula, data=predict(preProc, manTrain),
# do not drop the baselines of factors
contrasts=FALSE,
# use a sparse matrix
sparse=TRUE)
manY <- useful::build.y(valueFormula, data=manTrain)
manX_val <- useful::build.x(valueFormula, data=predict(preProc, manVal),
# do not drop the baselines of factors
contrasts=FALSE,
# use a sparse matrix
sparse=TRUE)
manY_val <- useful::build.y(valueFormula, data=manVal)
```

There are two functions we can use to fit `xgboost`

models, the eponymous `xgboost`

and `xgb.train`

. When using `xgb.train`

we first store our X and Y matrices in a special `xgb.DMatrix`

object. This is not a necessary step, but makes things a bit cleaner.

```
manXG <- xgb.DMatrix(data=manX, label=manY)
manXG_val <- xgb.DMatrix(data=manX_val, label=manY_val)
```

We are now ready to fit a model. All we need to do to fit a linear model instead of a tree is set `booster='gblinear'`

and `objective='reg:linear'`

.

```
mod1 <- xgb.train(
# the X and Y training data
data=manXG,
# use a linear model
booster='gblinear',
# minimize the a regression criterion
objective='reg:linear',
# use MAE as a measure of quality
eval_metric=c('mae'),
# boost for up to 500 rounds
nrounds=500,
# print out the eval_metric for both the train and validation data
watchlist=list(train=manXG, validate=manXG_val),
# print eval_metric every 10 rounds
print_every_n=10,
# if the validate eval_metric hasn't improved by this many rounds, stop early
early_stopping_rounds=25,
# penalty terms for the L2 portion of the Elastic Net
lambda=10, lambda_bias=10,
# penalty term for the L1 portion of the Elastic Net
alpha=900000000,
# randomly sample rows
subsample=0.8,
# randomly sample columns
col_subsample=0.7,
# set the learning rate for gradient descent
eta=0.1
)
```

```
## [1] train-mae:1190145.875000 validate-mae:1433464.750000
## Multiple eval metrics are present. Will use validate_mae for early stopping.
## Will train until validate_mae hasn't improved in 25 rounds.
##
## [11] train-mae:938069.937500 validate-mae:1257632.000000
## [21] train-mae:932016.625000 validate-mae:1113554.625000
## [31] train-mae:931483.500000 validate-mae:1062618.250000
## [41] train-mae:931146.750000 validate-mae:1054833.625000
## [51] train-mae:930707.312500 validate-mae:1062881.375000
## [61] train-mae:930137.375000 validate-mae:1077038.875000
## Stopping. Best iteration:
## [41] train-mae:931146.750000 validate-mae:1054833.625000
```

The best fit was arrived at after 41 rounds. We can see how the model did on the train and validate sets using `dygraphs`

.

`dygraphs::dygraph(mod1$evaluation_log)`

We can now plot the coefficients using `coefplot`

. Since `xgboost`

does not save column names, we specify it with `feature_names=colnames(manX)`

. Unlike with `glmnet`

models, there is only one penalty so we do not need to specify a specific penalty to plot.

`coefplot(mod1, feature_names=colnames(manX), sort='magnitude')`

This is another nice addition to `coefplot`

utilizing the power of `xgboost`

.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

## Leave a Reply