In my last post I discussed using coefplot on glmnet models and in particular discussed a brand new function, coefpath, that uses dygraphs to make an interactive visualization of the coefficient path.

Another new capability for version 1.2.5 of coefplot is the ability to show coefficient plots from xgboost models. Beyond fitting boosted trees and boosted forests, xgboost can also fit a boosted Elastic Net. This makes it a nice alternative to glmnet even though it might not have some of the same user niceties.

To illustrate, we use the same data as our previous post.

First, we load the packages we need and note the version numbers.

# list the packages that we load
# alphabetically for reproducibility
packages <- c('caret', 'coefplot', 'DT', 'xgboost')
# call library on each package
purrr::walk(packages, library, character.only=TRUE)

# some packages we will reference without actually loading
# they are listed here for complete documentation
packagesColon <- c('dplyr', 'dygraphs', 'knitr', 'magrittr', 'purrr', 'tibble', 'useful')
versions <- c(packages, packagesColon) %>% 
    purrr::map(packageVersion) %>% 
    purrr::map_chr(as.character)
packageDF <- tibble::data_frame(Package=c(packages, packagesColon), Version=versions) %>% 
    dplyr::arrange(Package)
knitr::kable(packageDF)
Package Version
caret 6.0.78
coefplot 1.2.6
dplyr 0.7.4
DT 0.2
dygraphs 1.1.1.4
knitr 1.18
magrittr 1.5
purrr 0.2.4
tibble 1.4.2
useful 1.2.3
xgboost 0.6.4

Then, we read the data. The data are available at http://www.jaredlander.com/data/manhattan_Train.rds with the CSV version at data.world. We also get validation data which is helpful when fitting xgboost mdoels.

manTrain <- readRDS(url('http://www.jaredlander.com/data/manhattan_Train.rds'))
manVal <- readRDS(url('http://www.jaredlander.com/data/manhattan_Validate.rds'))

The data are about New York City land value and have many columns. A sample of the data follows. There’s an odd bug where you have to click on one of the column names for the data to display the actual data.

datatable(manTrain %>% dplyr::sample_n(size=1000), elementId='TrainingSampled',
              rownames=FALSE,
              extensions=c('FixedHeader', 'Scroller'),
              options=list(
                  scroller=TRUE
              ))

While glmnet automatically standardizes the input data, xgboost does not, so we calculate that manually. We use preprocess from caret to compute the mean and standard deviation of each numeric column then use these later.

preProc <- preProcess(manTrain, method=c('center', 'scale'))

Just like with glmnet, we need to convert our tbl into an X (predictor) matrix and a Y (response) vector. Since we don’t have to worry about multicolinearity with xgboost we do not want to drop the baselines of factors. We also take advantage of sparse matrices since that reduces memory usage and compute, even though this dataset is not that large.

In order to build the matrix and vector we need a formula. This could be built programmatically, but we can just build it ourselves. The response is TotalValue.

valueFormula <- TotalValue ~ FireService + ZoneDist1 + ZoneDist2 +
    Class + LandUse + OwnerType + LotArea + BldgArea + ComArea + ResArea +
    OfficeArea + RetailArea + NumBldgs + NumFloors + UnitsRes + UnitsTotal + 
    LotDepth + LotFront + BldgFront + LotType + HistoricDistrict + Built + 
    Landmark
manX <- useful::build.x(valueFormula, data=predict(preProc, manTrain),
                        # do not drop the baselines of factors
                        contrasts=FALSE,
                        # use a sparse matrix
                        sparse=TRUE)

manY <- useful::build.y(valueFormula, data=manTrain)

manX_val <- useful::build.x(valueFormula, data=predict(preProc, manVal),
                        # do not drop the baselines of factors
                        contrasts=FALSE,
                        # use a sparse matrix
                        sparse=TRUE)

manY_val <- useful::build.y(valueFormula, data=manVal)

There are two functions we can use to fit xgboost models, the eponymous xgboost and xgb.train. When using xgb.train we first store our X and Y matrices in a special xgb.DMatrix object. This is not a necessary step, but makes things a bit cleaner.

manXG <- xgb.DMatrix(data=manX, label=manY)
manXG_val <- xgb.DMatrix(data=manX_val, label=manY_val)

We are now ready to fit a model. All we need to do to fit a linear model instead of a tree is set booster='gblinear' and objective='reg:linear'.

mod1 <- xgb.train(
    # the X and Y training data
    data=manXG,
    # use a linear model
    booster='gblinear',
    # minimize the a regression criterion 
    objective='reg:linear',
    # use MAE as a measure of quality
    eval_metric=c('mae'),
    # boost for up to 500 rounds
    nrounds=500,
    # print out the eval_metric for both the train and validation data
    watchlist=list(train=manXG, validate=manXG_val),
    # print eval_metric every 10 rounds
    print_every_n=10,
    # if the validate eval_metric hasn't improved by this many rounds, stop early
    early_stopping_rounds=25,
    # penalty terms for the L2 portion of the Elastic Net
    lambda=10, lambda_bias=10,
    # penalty term for the L1 portion of the Elastic Net
    alpha=900000000,
    # randomly sample rows
    subsample=0.8,
    # randomly sample columns
    col_subsample=0.7,
    # set the learning rate for gradient descent
    eta=0.1
)
## [1]  train-mae:1190145.875000    validate-mae:1433464.750000 
## Multiple eval metrics are present. Will use validate_mae for early stopping.
## Will train until validate_mae hasn't improved in 25 rounds.
## 
## [11] train-mae:938069.937500 validate-mae:1257632.000000 
## [21] train-mae:932016.625000 validate-mae:1113554.625000 
## [31] train-mae:931483.500000 validate-mae:1062618.250000 
## [41] train-mae:931146.750000 validate-mae:1054833.625000 
## [51] train-mae:930707.312500 validate-mae:1062881.375000 
## [61] train-mae:930137.375000 validate-mae:1077038.875000 
## Stopping. Best iteration:
## [41] train-mae:931146.750000 validate-mae:1054833.625000

The best fit was arrived at after 41 rounds. We can see how the model did on the train and validate sets using dygraphs.

dygraphs::dygraph(mod1$evaluation_log)

We can now plot the coefficients using coefplot. Since xgboost does not save column names, we specify it with feature_names=colnames(manX). Unlike with glmnet models, there is only one penalty so we do not need to specify a specific penalty to plot.

coefplot(mod1, feature_names=colnames(manX), sort='magnitude')

This is another nice addition to coefplot utilizing the power of xgboost.

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

I’m a big fan of the Elastic Net for variable selection and shrinkage and have given numerous talks about it and its implementation, glmnet. In fact, I will even have a DataCamp course about glmnet coming out soon.

As a side note, I used to pronounce it g-l-m-net but after having lunch with one of its creators, Trevor Hastie, I learn it is pronounced glimnet.

coefplot has long supported glmnet via a standard coefficient plot but I recently added some functionality, so let’s take a look. As we go through this, please pardon the htmlwidgets in iframes.

First, we load packages. I am now fond of using the following syntax for loading the packages we will be using.

# list the packages that we load
# alphabetically for reproducibility
packages <- c('coefplot', 'DT', 'glmnet')
# call library on each package
purrr::walk(packages, library, character.only=TRUE)

# some packages we will reference without actually loading
# they are listed here for complete documentation
packagesColon <- c('dplyr', 'knitr', 'magrittr', 'purrr', 'tibble', 'useful')

The versions can then be displayed in a table.

versions <- c(packages, packagesColon) %>% 
    purrr::map(packageVersion) %>% 
    purrr::map_chr(as.character)
packageDF <- tibble::data_frame(Package=c(packages, packagesColon), Version=versions) %>% 
    dplyr::arrange(Package)
knitr::kable(packageDF)
Package Version
coefplot 1.2.5.1
dplyr 0.7.4
DT 0.2
glmnet 2.0.13
knitr 1.18
magrittr 1.5
purrr 0.2.4
tibble 1.4.1
useful 1.2.3

First, we read some data. The data are available at http://www.jaredlander.com/data/manhattan_Train.rds with the CSV version at data.world.

manTrain <- readRDS(url('http://www.jaredlander.com/data/manhattan_Train.rds'))

The data are about New York City land value and have many columns. A sample of the data follows.

datatable(manTrain %>% dplyr::sample_n(size=100), elementId='DataSampled',
              rownames=FALSE,
              extensions=c('FixedHeader', 'Scroller'),
              options=list(
                  scroller=TRUE,
                  scrollY=300
              ))

In order to use glmnet we need to convert our tbl into an X (predictor) matrix and a Y (response) vector. Since we don’t have to worry about multicolinearity with glmnet we do not want to drop the baselines of factors. We also take advantage of sparse matrices since that reduces memory usage and compute, even though this dataset is not that large.

In order to build the matrix ad vector we need a formula. This could be built programmatically, but we can just build it ourselves. The response is TotalValue.

valueFormula <- TotalValue ~ FireService + ZoneDist1 + ZoneDist2 +
    Class + LandUse + OwnerType + LotArea + BldgArea + ComArea + ResArea +
    OfficeArea + RetailArea + NumBldgs + NumFloors + UnitsRes + UnitsTotal + 
    LotDepth + LotFront + BldgFront + LotType + HistoricDistrict + Built + 
    Landmark - 1

Notice the - 1 means do not include an intercept since glmnet will do that for us.

manX <- useful::build.x(valueFormula, data=manTrain,
                        # do not drop the baselines of factors
                        contrasts=FALSE,
                        # use a sparse matrix
                        sparse=TRUE)

manY <- useful::build.y(valueFormula, data=manTrain)

We are now ready to fit a model.

mod1 <- glmnet(x=manX, y=manY, family='gaussian')

We can view a coefficient plot for a given value of lambda like this.

coefplot(mod1, lambda=330500, sort='magnitude')

A common plot that is built into the glmnet package it the coefficient path.

plot(mod1, xvar='lambda', label=TRUE)

This plot shows the path the coefficients take as lambda increases. They greater lambda is, the more the coefficients get shrunk toward zero. The problem is, it is hard to disambiguate the lines and the labels are not informative.

Fortunately, coefplot has a new function in Version 1.2.5 called coefpath for making this into an interactive plot using dygraphs.

coefpath(mod1)

While still busy this function provides so much more functionality. We can hover over lines, zoom in then pan around.

These functions also work with any value for alpha and for cross-validated models fit with cv.glmnet.

mod2 <- cv.glmnet(x=manX, y=manY, family='gaussian', alpha=0.7, nfolds=5)

We plot coefficient plots for both optimal lambdas.

# coefplot for the 1se error lambda
coefplot(mod2, lambda='lambda.1se', sort='magnitude')

# coefplot for the min error lambda
coefplot(mod2, lambda='lambda.min', sort='magnitude')

The coefficient path is the same as before though the optimal lambdas are noted as dashed vertical lines.

coefpath(mod2)

While coefplot has long been able to plot coefficients from glmnet models, the new coefpath function goes a long way in helping visualize the paths the coefficients take as lambda changes.

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

This is the code for a webinar I gave with Dan Mbanga for Amazon’s AWS Webinar Series about deep learning using MXNet in R. The video is available on YouTube.

I am experimentally saving htmlwidget objects to HTML files then loading them with iframes so please excuse the scroll bars.

Packages

These are the packages we are using either by loading the entire package or using individual functions.

Data

This dataset is about property lots in Manhattan and includes descriptive information as well as value. The original data are available from NYC Planning and the prepared files seen here at the Lander Analytics data.world repo.

data_train <- readr::read_csv(file.path(dataDir, 'manhattan_Train.csv'))
data_validate <- readr::read_csv(file.path(dataDir, 'manhattan_Validate.csv'))
data_test <- readr::read_csv(file.path(dataDir, 'manhattan_Test.csv'))
# Remove some variables
data_train <- data_train %>% 
    select(-ID, -TotalValue, -Borough, -ZoneDist4, 
           -SchoolDistrict, -Council, -PolicePrct, -HealthArea)
data_validate <- data_validate %>% 
    select(-ID, -TotalValue, -Borough, -ZoneDist4, 
           -SchoolDistrict, -Council, -PolicePrct, -HealthArea)
data_test <- data_test %>% 
    select(-ID, -TotalValue, -Borough, -ZoneDist4, 
           -SchoolDistrict, -Council, -PolicePrct, -HealthArea)

This is a glimpse of the data.

Here is a visualization of the class balance.

dataList <- list(data_train, data_validate, data_test)
dataList %>% 
    purrr::map(function(x) figure(width=600/NROW(dataList), height=500, legend_location=NULL) %>% 
                   ly_bar(x=High, color=factor(High), data=x, legend='code')
    ) %>% 
    grid_plot(nrow=1, ncol=NROW(dataList), same_axes=TRUE)

We use vtreat to do some automated feature engineering.

# The column name for the response
responseName <- 'High'
# The target value for the response
responseTarget <- TRUE
# The remaining columns are predictors
varNames <- setdiff(names(data_train), responseName)

# build the treatment design
treatmentDesign <- designTreatmentsC(dframe=data_train, varlist=varNames, 
                                     outcomename=responseName, 
                                     outcometarget=responseTarget, 
                                     verbose=TRUE)
## [1] "desigining treatments Mon Jun 26 01:23:38 2017"
## [1] "designing treatments Mon Jun 26 01:23:38 2017"
## [1] " have level statistics Mon Jun 26 01:23:39 2017"
## [1] "design var FireService Mon Jun 26 01:23:39 2017"
## [1] "design var ZoneDist1 Mon Jun 26 01:23:39 2017"
## [1] "design var ZoneDist2 Mon Jun 26 01:23:40 2017"
## [1] "design var ZoneDist3 Mon Jun 26 01:23:40 2017"
## [1] "design var Class Mon Jun 26 01:23:41 2017"
## [1] "design var LandUse Mon Jun 26 01:23:42 2017"
## [1] "design var Easements Mon Jun 26 01:23:43 2017"
## [1] "design var OwnerType Mon Jun 26 01:23:43 2017"
## [1] "design var LotArea Mon Jun 26 01:23:43 2017"
## [1] "design var BldgArea Mon Jun 26 01:23:43 2017"
## [1] "design var ComArea Mon Jun 26 01:23:43 2017"
## [1] "design var ResArea Mon Jun 26 01:23:44 2017"
## [1] "design var OfficeArea Mon Jun 26 01:23:44 2017"
## [1] "design var RetailArea Mon Jun 26 01:23:44 2017"
## [1] "design var GarageArea Mon Jun 26 01:23:44 2017"
## [1] "design var StrgeArea Mon Jun 26 01:23:44 2017"
## [1] "design var FactryArea Mon Jun 26 01:23:44 2017"
## [1] "design var OtherArea Mon Jun 26 01:23:44 2017"
## [1] "design var NumBldgs Mon Jun 26 01:23:44 2017"
## [1] "design var NumFloors Mon Jun 26 01:23:44 2017"
## [1] "design var UnitsRes Mon Jun 26 01:23:44 2017"
## [1] "design var UnitsTotal Mon Jun 26 01:23:44 2017"
## [1] "design var LotFront Mon Jun 26 01:23:44 2017"
## [1] "design var LotDepth Mon Jun 26 01:23:44 2017"
## [1] "design var BldgFront Mon Jun 26 01:23:44 2017"
## [1] "design var BldgDepth Mon Jun 26 01:23:45 2017"
## [1] "design var Extension Mon Jun 26 01:23:45 2017"
## [1] "design var Proximity Mon Jun 26 01:23:45 2017"
## [1] "design var IrregularLot Mon Jun 26 01:23:45 2017"
## [1] "design var LotType Mon Jun 26 01:23:46 2017"
## [1] "design var BasementType Mon Jun 26 01:23:46 2017"
## [1] "design var Landmark Mon Jun 26 01:23:47 2017"
## [1] "design var BuiltFAR Mon Jun 26 01:23:47 2017"
## [1] "design var ResidFAR Mon Jun 26 01:23:47 2017"
## [1] "design var CommFAR Mon Jun 26 01:23:47 2017"
## [1] "design var FacilFAR Mon Jun 26 01:23:47 2017"
## [1] "design var Built Mon Jun 26 01:23:47 2017"
## [1] "design var HistoricDistrict Mon Jun 26 01:23:48 2017"
## [1] " scoring treatments Mon Jun 26 01:23:48 2017"
## [1] "have treatment plan Mon Jun 26 01:24:19 2017"
## [1] "rescoring complex variables Mon Jun 26 01:24:19 2017"
## [1] "done rescoring complex variables Mon Jun 26 01:24:32 2017"

Then we create train, validate and test matrices.

# build design data.frames
dataTrain <- prepare(treatmentplan=treatmentDesign, dframe=data_train)
dataValidate <- prepare(treatmentplan=treatmentDesign, dframe=data_validate)
dataTest <- prepare(treatmentplan=treatmentDesign, dframe=data_test)

# use all the level names as predictors
predictorNames <- setdiff(names(dataTrain), responseName)

# training matrices
trainX <- data.matrix(dataTrain[, predictorNames])
trainY <- dataTrain[, responseName]

# validation matrices
validateX <- data.matrix(dataValidate[, predictorNames])
validateY <- dataValidate[, responseName]

# test matrices
testX <- data.matrix(dataTest[, predictorNames])
testY <- dataTest[, responseName]

# Sparse versions for some models
trainX_sparse <- sparse.model.matrix(object=High ~ ., data=dataTrain)
validateX_sparse <- sparse.model.matrix(object=High ~ ., data=dataValidate)
testX_sparse <- sparse.model.matrix(object=High ~ ., data=dataTest)

Feedforward Network

Helper Functions

This is a function that allows mxnet to calculate log-loss based on the logloss function from the Metrics package.

# log-loss
mx.metric.mlogloss <- mx.metric.custom("mlogloss", function(label, pred){
    return(Metrics::logLoss(label, pred))
})

Network Formulation

We build the model symbolically. We use a feedforward network with two hidden layers. The first hidden layer has 256 units and the second has 128 units. We also use dropout and batch normalization for regularization. The last step is to use a logistic sigmoid (inverse logit) for the logistic regression output.

net <- mx.symbol.Variable('data') %>%
    # drop out 20% of predictors
    mx.symbol.Dropout(p=0.2, name='Predictor_Dropout') %>%
    # a fully connected layer with 256 units
    mx.symbol.FullyConnected(num_hidden=256, name='fc_1') %>%
    # batch normalize the units
    mx.symbol.BatchNorm(name='bn_1') %>%
    # use the rectified linear unit (relu) for the activation function
    mx.symbol.Activation(act_type='relu', name='relu_1') %>%
    # drop out 50% of the units
    mx.symbol.Dropout(p=0.5, name='dropout_1') %>%
    # a fully connected layer with 128 units
    mx.symbol.FullyConnected(num_hidden=128, name='fc_2') %>%
    # batch normalize the units
    mx.symbol.BatchNorm(name='bn_2') %>%
    # use the rectified linear unit (relu) for the activation function
    mx.symbol.Activation(act_type='relu', name='relu_2') %>%
    # drop out 50% of the units
    mx.symbol.Dropout(p=0.5, name='dropout_2') %>%
    # fully connect to the output layer which has just the 1 unit
    mx.symbol.FullyConnected(num_hidden=1, name='out') %>%
    # use the sigmoid output
    mx.symbol.LogisticRegressionOutput(name='output')

Inspect the Network

By inspecting the symbolic network we see that it is actually just a C++ pointer. We also see its arguments and a visualization.

net
## C++ object <0000000018ed9aa0> of class 'MXSymbol' <0000000018c30d00>
arguments(net)
##  [1] "data"         "fc_1_weight"  "fc_1_bias"    "bn_1_gamma"  
##  [5] "bn_1_beta"    "fc_2_weight"  "fc_2_bias"    "bn_2_gamma"  
##  [9] "bn_2_beta"    "out_weight"   "out_bias"     "output_label"
graph.viz(net)

Network Training

With the data prepared and the network specified we now train the model. First we set the envinronment variable MXNET_CPU_WORKER_NTHREADS=4 since this demo is on a laptop with four threads. Using a GPU will speed up the computations. We also set the random seed with mx.set.seed for reproducibility.

We use the Adam optimization algorithm which has an adaptive learning rate which incorporates momentum.

# use four CPU threads
Sys.setenv('MXNET_CPU_WORKER_NTHREADS'=4)

# set the random seed
mx.set.seed(1234)

# train the model
mod_net <- mx.model.FeedForward.create(
    symbol            = net,    # the symbolic network
    X                 = trainX, # the predictors
    y                 = trainY, # the response
    optimizer         = "adam", # using the Adam optimization method
    eval.data         = list(data=validateX, label=validateY), # validation data
    ctx               = mx.cpu(), # use the cpu for training
    eval.metric       = mx.metric.mlogloss, # evaluate with log-loss
    num.round         = 50,     # 50 epochs
    learning.rate     = 0.001,   # learning rate
    array.batch.size  = 256,    # batch size
    array.layout      = "rowmajor"  # the data is stored in row major format
)

Predictions

Statisticians call this step prediction while the deep learning field calls it inference which has an entirely different meaning in statistics.

preds_net <- predict(mod_net, testX, array.layout="rowmajor") %>% t

Elastic Net

Model Training

We fit an Elastic Net model with glmnet.

registerDoParallel(cl=4)

set.seed(1234)
mod_glmnet <- cv.glmnet(x=trainX_sparse, y=trainY, 
                        alpha=.5, family='binomial', 
                        type.measure='auc',
                        nfolds=5, parallel=TRUE)

Predictions

preds_glmnet <- predict(mod_glmnet, newx=testX_sparse, s='lambda.1se', type='response')

XGBoost

Model Training

We fit a random forest with xgboost.

set.seed(1234)

trainXG <- xgb.DMatrix(data=trainX_sparse, label=trainY)
validateXG <- xgb.DMatrix(data=validateX_sparse, label=validateY)

watchlist <- list(train=trainXG, validate=validateXG)

mod_xgboost <- xgb.train(data=trainXG, 
                nrounds=1, nthread=4, 
                num_parallel_tree=500, subsample=0.5, colsample_bytree=0.5,
                objective='binary:logistic',
                eval_metric = "error", eval_metric = "logloss",
                print_every_n=1, watchlist=watchlist)
## [1]  train-error:0.104713    train-logloss:0.525032  validate-error:0.108254 validate-logloss:0.527535

Predictions

preds_xgboost <- predict(mod_xgboost, newdata=testX_sparse)

SVM

Model Training

set.seed(1234)
mod_svm <- e1071::svm(x=trainX_sparse, y=trainY, probability=TRUE, type='C')

This model did not train in a reasonable time.

Results

ROC

rocData <- dplyr::bind_rows(
    cbind(data.frame(roc(testY, preds_glmnet, direction="<")[c('specificities', 'sensitivities')]), Model='glmnet'),
    cbind(data.frame(roc(testY, preds_xgboost, direction="<")[c('specificities', 'sensitivities')]), Model='xgboost'),
    cbind(data.frame(roc(testY, preds_net, direction="<")[c('specificities', 'sensitivities')]), Model='Net')
)
ggplotly(ggplot(rocData, aes(x=specificities, y=sensitivities)) + geom_line(aes(color=Model, group=Model)) + scale_x_reverse(), width=800, height=600)

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.