I’m a big fan of the Elastic Net for variable selection and shrinkage and have given numerous talks about it and its implementation, glmnet. In fact, I will even have a DataCamp course about glmnet coming out soon.

As a side note, I used to pronounce it g-l-m-net but after having lunch with one of its creators, Trevor Hastie, I learn it is pronounced glimnet.

coefplot has long supported glmnet via a standard coefficient plot but I recently added some functionality, so let’s take a look. As we go through this, please pardon the htmlwidgets in iframes.

First, we load packages. I am now fond of using the following syntax for loading the packages we will be using.

# list the packages that we load
# alphabetically for reproducibility
packages <- c('coefplot', 'DT', 'glmnet')
# call library on each package
purrr::walk(packages, library, character.only=TRUE)

# some packages we will reference without actually loading
# they are listed here for complete documentation
packagesColon <- c('dplyr', 'knitr', 'magrittr', 'purrr', 'tibble', 'useful')

The versions can then be displayed in a table.

versions <- c(packages, packagesColon) %>% 
    purrr::map(packageVersion) %>% 
    purrr::map_chr(as.character)
packageDF <- tibble::data_frame(Package=c(packages, packagesColon), Version=versions) %>% 
    dplyr::arrange(Package)
knitr::kable(packageDF)
Package Version
coefplot 1.2.5.1
dplyr 0.7.4
DT 0.2
glmnet 2.0.13
knitr 1.18
magrittr 1.5
purrr 0.2.4
tibble 1.4.1
useful 1.2.3

First, we read some data. The data are available at http://www.jaredlander.com/data/manhattan_Train.rds with the CSV version at data.world.

manTrain <- readRDS(url('http://www.jaredlander.com/data/manhattan_Train.rds'))

The data are about New York City land value and have many columns. A sample of the data follows.

datatable(manTrain %>% dplyr::sample_n(size=100), elementId='DataSampled',
              rownames=FALSE,
              extensions=c('FixedHeader', 'Scroller'),
              options=list(
                  scroller=TRUE,
                  scrollY=300
              ))

In order to use glmnet we need to convert our tbl into an X (predictor) matrix and a Y (response) vector. Since we don’t have to worry about multicolinearity with glmnet we do not want to drop the baselines of factors. We also take advantage of sparse matrices since that reduces memory usage and compute, even though this dataset is not that large.

In order to build the matrix ad vector we need a formula. This could be built programmatically, but we can just build it ourselves. The response is TotalValue.

valueFormula <- TotalValue ~ FireService + ZoneDist1 + ZoneDist2 +
    Class + LandUse + OwnerType + LotArea + BldgArea + ComArea + ResArea +
    OfficeArea + RetailArea + NumBldgs + NumFloors + UnitsRes + UnitsTotal + 
    LotDepth + LotFront + BldgFront + LotType + HistoricDistrict + Built + 
    Landmark - 1

Notice the - 1 means do not include an intercept since glmnet will do that for us.

manX <- useful::build.x(valueFormula, data=manTrain,
                        # do not drop the baselines of factors
                        contrasts=FALSE,
                        # use a sparse matrix
                        sparse=TRUE)

manY <- useful::build.y(valueFormula, data=manTrain)

We are now ready to fit a model.

mod1 <- glmnet(x=manX, y=manY, family='gaussian')

We can view a coefficient plot for a given value of lambda like this.

coefplot(mod1, lambda=330500, sort='magnitude')

A common plot that is built into the glmnet package it the coefficient path.

plot(mod1, xvar='lambda', label=TRUE)

This plot shows the path the coefficients take as lambda increases. They greater lambda is, the more the coefficients get shrunk toward zero. The problem is, it is hard to disambiguate the lines and the labels are not informative.

Fortunately, coefplot has a new function in Version 1.2.5 called coefpath for making this into an interactive plot using dygraphs.

coefpath(mod1)

While still busy this function provides so much more functionality. We can hover over lines, zoom in then pan around.

These functions also work with any value for alpha and for cross-validated models fit with cv.glmnet.

mod2 <- cv.glmnet(x=manX, y=manY, family='gaussian', alpha=0.7, nfolds=5)

We plot coefficient plots for both optimal lambdas.

# coefplot for the 1se error lambda
coefplot(mod2, lambda='lambda.1se', sort='magnitude')

# coefplot for the min error lambda
coefplot(mod2, lambda='lambda.min', sort='magnitude')

The coefficient path is the same as before though the optimal lambdas are noted as dashed vertical lines.

coefpath(mod2)

While coefplot has long been able to plot coefficients from glmnet models, the new coefpath function goes a long way in helping visualize the paths the coefficients take as lambda changes.

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

The biggest event for me this year was completely outside of work and had nothing to do with statistics or R: I got married. We technically met at the Open Stats meetup and I did build our wedding website with RMarkdown, so R was still involved. We just returned from our around-the-world honeymoon so I thought the best way to track our travels would be with maps and globes using leaflet and threejs.

Before we get to any code, the following packages were used in making this post.

This was an extensive trip that, in addition to traditional vacation activities, included a few visits to clients and speaking and a few conferences and meetups. In all, we visited, London, Singapore, Hong Kong, Auckland, Queenstown, Bora Bora, Tahiti, Moorea, San Jose and San Francisco, with a connection or two in between.

The airport/ferry codes for our trip were the following.

Origin Destination Airline
JFK LGW Norwegian Air
LHR SIN Singapore Airlines
SIN HKG Singapore Airlines
HKG AKL Cathay Pacific
AKL ZQN Air New Zealand
ZQN AKL Air New Zealand
AKL PPT Air New Zealand
PPT BOB Air Tahiti
BOB MOZ Air Tahiti
MOZ PPT Terevau
PPT LAX Air Tahiti Nui
LAX SJC Alaska Airlines
SFO JFK JetBlue

Converting these to latitude and longitude is easy thanks to Open Flights.

# read in the data
airports <- readr::read_csv('https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports-extended.dat',
                            # give it good column names since the data are headerless
                            col_names=c('ID', 'Name', 'City', 'Country', 
                                        'IATA', 'ICAO', 'Latitude', 'Longitude', 
                                        'Altitude', 'Timezone', 'DST', 'Tz', 
                                        'Type', 'Source'))

We then use filter to get just the ports we visited. Notice how we use a second tbl inside filter.

visited <- airports %>% 
    select(Name, City, Country, IATA, Latitude, Longitude) %>% 
    filter(IATA %in% (codes %>% select(Origin, Destination) %>% unlist))
DT::datatable(visited, elementId='AirportsTable',
              rownames=FALSE,
              extensions=c('FixedHeader', 'Scroller'),
              options=list(
                  dom='<"top"f>rt<"bottom"i><"clear">'
                  ,
                  scrollY=200,
                  scroller=TRUE
              )
) %>% 
    DT::formatRound(columns=c('Latitude', 'Longitude'), digits=2)

We then manually reorder the airports so that edges can be drawn nicely between them. This is akin to creating an edgelist of airport-pairs. This is not the most robust way of creating this list, but suffices for our purposes.

visitedOrdered <- visited %>% 
    slice(c(12, 1, 2, 8, 7, 5, 6, 5, 13, 3, 4, 13, 10, 11, 12))

For the first visualization let’s create a map using leaflet.

# initialize the widget
leaflet(data=visitedOrdered) %>% 
    # overlay map tiles
    addTiles() %>% 
    # plot lines from one point to another
    addPolylines(lng=~Longitude, lat=~Latitude) %>% 
    # add markers with city names
    addMarkers(lng=~Longitude, lat=~Latitude, popup=~City)

Unfortunately, this doesn’t quite capture the directions of the flights as it makes it look like we flew back west to get to Papeete. So let’s try a globe instead using threejs.

We augment the edgelist of airports so that it has the latitude and longitude of the origin and destination airports for each flight.

flightPaths <- codes %>% 
    left_join(visited %>% select(IATA, Longitude, Latitude), by=c('Origin'='IATA')) %>% 
    rename(oLong=Longitude, oLat=Latitude) %>% 
    left_join(visited %>% select(IATA, Longitude, Latitude), by=c('Destination'='IATA')) %>% 
    rename(dLong=Longitude, dLat=Latitude)
DT::datatable(visited, elementId='FlightPathLatLong',
              rownames=FALSE,
              extensions=c('FixedHeader', 'Scroller'),
              options=list(
                  dom='<"top"f>rt<"bottom"i><"clear">'
                  ,
                  scrollY=200,
                  scroller=TRUE
              )
) %>% 
    DT::formatRound(columns=c('Latitude', 'Longitude'), digits=2)

Now we can provide that data to threejs. We first specify an image to overlay on the globe. Then we specify the latitude and longitude of visited airports. After that, we provide the origin and destination latitudes and longitudes of our flights. The rest of the arguments are cosmetic.

globejs(
    # the image to overlay on the globe
    img="http://eoimages.gsfc.nasa.gov/images/imagerecords/73000/73909/world.topo.bathy.200412.3x5400x2700.jpg",
    # lat/long of visited airports
    lat=visited$Latitude, long=visited$Longitude,
    # lat/long of origin and destination
    arcs=flightPaths %>% select(oLat, oLong, dLat, dLong),
    # cosmetic adjustments
    arcsHeight=.4, arcsLwd=7, arcsColor="red", arcsOpacity=.95,
    atmosphere=FALSE, fov=30, rotationlat=0.3, rotationlong=.8*pi)

We now calculate the total distance traveled (not including car trips) using Haversine Distance to account for the curvature of the Earth.

distHaversine(visitedOrdered %>% select(Longitude, Latitude), r=3959) %>% sum
## [1] 28660.52

So we traveled 3,760 more miles than the circumference of the Earth.

Beyond the epic proportions of our travel, this honeymoon was outstanding from the sheer length, to the vastly different places we visited, to the food we ate and the sights we saw, to the activities we participated in, to the great people along the way. And, of course, it’s amazing to spend a month traveling with your favorite person.

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

This is the code for a webinar I gave with Dan Mbanga for Amazon’s AWS Webinar Series about deep learning using MXNet in R. The video is available on YouTube.

I am experimentally saving htmlwidget objects to HTML files then loading them with iframes so please excuse the scroll bars.

Packages

These are the packages we are using either by loading the entire package or using individual functions.

Data

This dataset is about property lots in Manhattan and includes descriptive information as well as value. The original data are available from NYC Planning and the prepared files seen here at the Lander Analytics data.world repo.

data_train <- readr::read_csv(file.path(dataDir, 'manhattan_Train.csv'))
data_validate <- readr::read_csv(file.path(dataDir, 'manhattan_Validate.csv'))
data_test <- readr::read_csv(file.path(dataDir, 'manhattan_Test.csv'))
# Remove some variables
data_train <- data_train %>% 
    select(-ID, -TotalValue, -Borough, -ZoneDist4, 
           -SchoolDistrict, -Council, -PolicePrct, -HealthArea)
data_validate <- data_validate %>% 
    select(-ID, -TotalValue, -Borough, -ZoneDist4, 
           -SchoolDistrict, -Council, -PolicePrct, -HealthArea)
data_test <- data_test %>% 
    select(-ID, -TotalValue, -Borough, -ZoneDist4, 
           -SchoolDistrict, -Council, -PolicePrct, -HealthArea)

This is a glimpse of the data.

Here is a visualization of the class balance.

dataList <- list(data_train, data_validate, data_test)
dataList %>% 
    purrr::map(function(x) figure(width=600/NROW(dataList), height=500, legend_location=NULL) %>% 
                   ly_bar(x=High, color=factor(High), data=x, legend='code')
    ) %>% 
    grid_plot(nrow=1, ncol=NROW(dataList), same_axes=TRUE)

We use vtreat to do some automated feature engineering.

# The column name for the response
responseName <- 'High'
# The target value for the response
responseTarget <- TRUE
# The remaining columns are predictors
varNames <- setdiff(names(data_train), responseName)

# build the treatment design
treatmentDesign <- designTreatmentsC(dframe=data_train, varlist=varNames, 
                                     outcomename=responseName, 
                                     outcometarget=responseTarget, 
                                     verbose=TRUE)
## [1] "desigining treatments Mon Jun 26 01:23:38 2017"
## [1] "designing treatments Mon Jun 26 01:23:38 2017"
## [1] " have level statistics Mon Jun 26 01:23:39 2017"
## [1] "design var FireService Mon Jun 26 01:23:39 2017"
## [1] "design var ZoneDist1 Mon Jun 26 01:23:39 2017"
## [1] "design var ZoneDist2 Mon Jun 26 01:23:40 2017"
## [1] "design var ZoneDist3 Mon Jun 26 01:23:40 2017"
## [1] "design var Class Mon Jun 26 01:23:41 2017"
## [1] "design var LandUse Mon Jun 26 01:23:42 2017"
## [1] "design var Easements Mon Jun 26 01:23:43 2017"
## [1] "design var OwnerType Mon Jun 26 01:23:43 2017"
## [1] "design var LotArea Mon Jun 26 01:23:43 2017"
## [1] "design var BldgArea Mon Jun 26 01:23:43 2017"
## [1] "design var ComArea Mon Jun 26 01:23:43 2017"
## [1] "design var ResArea Mon Jun 26 01:23:44 2017"
## [1] "design var OfficeArea Mon Jun 26 01:23:44 2017"
## [1] "design var RetailArea Mon Jun 26 01:23:44 2017"
## [1] "design var GarageArea Mon Jun 26 01:23:44 2017"
## [1] "design var StrgeArea Mon Jun 26 01:23:44 2017"
## [1] "design var FactryArea Mon Jun 26 01:23:44 2017"
## [1] "design var OtherArea Mon Jun 26 01:23:44 2017"
## [1] "design var NumBldgs Mon Jun 26 01:23:44 2017"
## [1] "design var NumFloors Mon Jun 26 01:23:44 2017"
## [1] "design var UnitsRes Mon Jun 26 01:23:44 2017"
## [1] "design var UnitsTotal Mon Jun 26 01:23:44 2017"
## [1] "design var LotFront Mon Jun 26 01:23:44 2017"
## [1] "design var LotDepth Mon Jun 26 01:23:44 2017"
## [1] "design var BldgFront Mon Jun 26 01:23:44 2017"
## [1] "design var BldgDepth Mon Jun 26 01:23:45 2017"
## [1] "design var Extension Mon Jun 26 01:23:45 2017"
## [1] "design var Proximity Mon Jun 26 01:23:45 2017"
## [1] "design var IrregularLot Mon Jun 26 01:23:45 2017"
## [1] "design var LotType Mon Jun 26 01:23:46 2017"
## [1] "design var BasementType Mon Jun 26 01:23:46 2017"
## [1] "design var Landmark Mon Jun 26 01:23:47 2017"
## [1] "design var BuiltFAR Mon Jun 26 01:23:47 2017"
## [1] "design var ResidFAR Mon Jun 26 01:23:47 2017"
## [1] "design var CommFAR Mon Jun 26 01:23:47 2017"
## [1] "design var FacilFAR Mon Jun 26 01:23:47 2017"
## [1] "design var Built Mon Jun 26 01:23:47 2017"
## [1] "design var HistoricDistrict Mon Jun 26 01:23:48 2017"
## [1] " scoring treatments Mon Jun 26 01:23:48 2017"
## [1] "have treatment plan Mon Jun 26 01:24:19 2017"
## [1] "rescoring complex variables Mon Jun 26 01:24:19 2017"
## [1] "done rescoring complex variables Mon Jun 26 01:24:32 2017"

Then we create train, validate and test matrices.

# build design data.frames
dataTrain <- prepare(treatmentplan=treatmentDesign, dframe=data_train)
dataValidate <- prepare(treatmentplan=treatmentDesign, dframe=data_validate)
dataTest <- prepare(treatmentplan=treatmentDesign, dframe=data_test)

# use all the level names as predictors
predictorNames <- setdiff(names(dataTrain), responseName)

# training matrices
trainX <- data.matrix(dataTrain[, predictorNames])
trainY <- dataTrain[, responseName]

# validation matrices
validateX <- data.matrix(dataValidate[, predictorNames])
validateY <- dataValidate[, responseName]

# test matrices
testX <- data.matrix(dataTest[, predictorNames])
testY <- dataTest[, responseName]

# Sparse versions for some models
trainX_sparse <- sparse.model.matrix(object=High ~ ., data=dataTrain)
validateX_sparse <- sparse.model.matrix(object=High ~ ., data=dataValidate)
testX_sparse <- sparse.model.matrix(object=High ~ ., data=dataTest)

Feedforward Network

Helper Functions

This is a function that allows mxnet to calculate log-loss based on the logloss function from the Metrics package.

# log-loss
mx.metric.mlogloss <- mx.metric.custom("mlogloss", function(label, pred){
    return(Metrics::logLoss(label, pred))
})

Network Formulation

We build the model symbolically. We use a feedforward network with two hidden layers. The first hidden layer has 256 units and the second has 128 units. We also use dropout and batch normalization for regularization. The last step is to use a logistic sigmoid (inverse logit) for the logistic regression output.

net <- mx.symbol.Variable('data') %>%
    # drop out 20% of predictors
    mx.symbol.Dropout(p=0.2, name='Predictor_Dropout') %>%
    # a fully connected layer with 256 units
    mx.symbol.FullyConnected(num_hidden=256, name='fc_1') %>%
    # batch normalize the units
    mx.symbol.BatchNorm(name='bn_1') %>%
    # use the rectified linear unit (relu) for the activation function
    mx.symbol.Activation(act_type='relu', name='relu_1') %>%
    # drop out 50% of the units
    mx.symbol.Dropout(p=0.5, name='dropout_1') %>%
    # a fully connected layer with 128 units
    mx.symbol.FullyConnected(num_hidden=128, name='fc_2') %>%
    # batch normalize the units
    mx.symbol.BatchNorm(name='bn_2') %>%
    # use the rectified linear unit (relu) for the activation function
    mx.symbol.Activation(act_type='relu', name='relu_2') %>%
    # drop out 50% of the units
    mx.symbol.Dropout(p=0.5, name='dropout_2') %>%
    # fully connect to the output layer which has just the 1 unit
    mx.symbol.FullyConnected(num_hidden=1, name='out') %>%
    # use the sigmoid output
    mx.symbol.LogisticRegressionOutput(name='output')

Inspect the Network

By inspecting the symbolic network we see that it is actually just a C++ pointer. We also see its arguments and a visualization.

net
## C++ object <0000000018ed9aa0> of class 'MXSymbol' <0000000018c30d00>
arguments(net)
##  [1] "data"         "fc_1_weight"  "fc_1_bias"    "bn_1_gamma"  
##  [5] "bn_1_beta"    "fc_2_weight"  "fc_2_bias"    "bn_2_gamma"  
##  [9] "bn_2_beta"    "out_weight"   "out_bias"     "output_label"
graph.viz(net)

Network Training

With the data prepared and the network specified we now train the model. First we set the envinronment variable MXNET_CPU_WORKER_NTHREADS=4 since this demo is on a laptop with four threads. Using a GPU will speed up the computations. We also set the random seed with mx.set.seed for reproducibility.

We use the Adam optimization algorithm which has an adaptive learning rate which incorporates momentum.

# use four CPU threads
Sys.setenv('MXNET_CPU_WORKER_NTHREADS'=4)

# set the random seed
mx.set.seed(1234)

# train the model
mod_net <- mx.model.FeedForward.create(
    symbol            = net,    # the symbolic network
    X                 = trainX, # the predictors
    y                 = trainY, # the response
    optimizer         = "adam", # using the Adam optimization method
    eval.data         = list(data=validateX, label=validateY), # validation data
    ctx               = mx.cpu(), # use the cpu for training
    eval.metric       = mx.metric.mlogloss, # evaluate with log-loss
    num.round         = 50,     # 50 epochs
    learning.rate     = 0.001,   # learning rate
    array.batch.size  = 256,    # batch size
    array.layout      = "rowmajor"  # the data is stored in row major format
)

Predictions

Statisticians call this step prediction while the deep learning field calls it inference which has an entirely different meaning in statistics.

preds_net <- predict(mod_net, testX, array.layout="rowmajor") %>% t

Elastic Net

Model Training

We fit an Elastic Net model with glmnet.

registerDoParallel(cl=4)

set.seed(1234)
mod_glmnet <- cv.glmnet(x=trainX_sparse, y=trainY, 
                        alpha=.5, family='binomial', 
                        type.measure='auc',
                        nfolds=5, parallel=TRUE)

Predictions

preds_glmnet <- predict(mod_glmnet, newx=testX_sparse, s='lambda.1se', type='response')

XGBoost

Model Training

We fit a random forest with xgboost.

set.seed(1234)

trainXG <- xgb.DMatrix(data=trainX_sparse, label=trainY)
validateXG <- xgb.DMatrix(data=validateX_sparse, label=validateY)

watchlist <- list(train=trainXG, validate=validateXG)

mod_xgboost <- xgb.train(data=trainXG, 
                nrounds=1, nthread=4, 
                num_parallel_tree=500, subsample=0.5, colsample_bytree=0.5,
                objective='binary:logistic',
                eval_metric = "error", eval_metric = "logloss",
                print_every_n=1, watchlist=watchlist)
## [1]  train-error:0.104713    train-logloss:0.525032  validate-error:0.108254 validate-logloss:0.527535

Predictions

preds_xgboost <- predict(mod_xgboost, newdata=testX_sparse)

SVM

Model Training

set.seed(1234)
mod_svm <- e1071::svm(x=trainX_sparse, y=trainY, probability=TRUE, type='C')

This model did not train in a reasonable time.

Results

ROC

rocData <- dplyr::bind_rows(
    cbind(data.frame(roc(testY, preds_glmnet, direction="<")[c('specificities', 'sensitivities')]), Model='glmnet'),
    cbind(data.frame(roc(testY, preds_xgboost, direction="<")[c('specificities', 'sensitivities')]), Model='xgboost'),
    cbind(data.frame(roc(testY, preds_net, direction="<")[c('specificities', 'sensitivities')]), Model='Net')
)
ggplotly(ggplot(rocData, aes(x=specificities, y=sensitivities)) + geom_line(aes(color=Model, group=Model)) + scale_x_reverse(), width=800, height=600)

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Shortly after I learned LaTeX I used it to write my resume (or CV if you will), freeing me from the headache of using Microsoft Word and the associated formatting troubles. Even that wasn’t enough though because different audiences needed different information and job listings. I could have stored all the information in the file and commented out bullet points I did not want to use, but that seemed sloppy. So instead I wrote an R package called resumer.

The trick is to store all of the data in a CSV, one row per bullet point.1

JobName Company Location Title Start End Bullet BulletName Type Description
Tech Startup Pied Piper New York, NY CTO 2013 Present Set up company’s computing platform 1 Job NA
Tech Startup Pied Piper New York, NY CTO 2013 Present Designed data strategy overseeing many datasources 2 Job NA
Tech Startup Pied Piper New York, NY CTO 2013 Present Constructed statistical models for predictive analytics of big data 3 Job NA
Large Bank Goliath National Bank New York, NY Quant 2011 2013 Built quantitative models for derivatives trades 1 Job NA
Large Bank Goliath National Bank New York, NY Quant 2011 2013 Wrote algorithms using the R statistical programming language 2 Job NA
Bank Intern Goliath National Bank New York, NY Intern 2010 NA Got coffee for senior staff 1 Job NA

Each row represents a detail about a job. So a job may take multiple rows.

The columns are:

  • JobName: Name identifying this job. This is identifying information used when selecting which jobs to display.
  • Company: Name of company.
  • Location: Physical location of job.
  • Title: Title held at job.
  • Start: Start date of job, usually represented by a year.
  • End: End date of job. This would ordinarily by a year, ‘Present’ or blank.
  • Bullet: The detail about the job.
  • BulletName: Identifier for this detail, used when selecting which details to display.
  • Type: Should be either Job or Research.
  • Description: Used for a quick blurb about research roles.

There are many parts to using this package which are all explained in the README and mostly reproduced here.

The yaml header holds your name, address, the location of the jobs CSV file, education information and any highlights. Remember, proper indenting is required for yaml.

The name and address fields are self explanatory. output takes the form of package::function which for this package is resumer::resumer.

The location of the jobs CSV is specified in the JobFile slot of the params entry. This should be the absolute path to the CSV.

These would look like this.

---
name: "Generic Name"
address: "New York"
output: resumer::resumer
params:
    JobFile: "examples/jobs.csv"
---

Supplying education information is done as a list in the education entry, with each school containing slots for school, dates and optionally notes. Each slot of the list is started with a -. The notes slot starts with a | and each line (except the last line) must end with two spaces.

For example:

---
education:
-   school: "Hudson University"
    dates: "2007--2009"
    notes: |
        GPA 3.955  
        Master of Arts in Statistics
-   school: "Smallville College"
    dates: "2000--2004"
    notes: |
        Cumulative GPA 3.838 Summa Cum Laude, Honors in Mathematics  
        Bachelor of Science in Mathematics, Journalism Minor  
        The Wayne Award for Excellence in Mathematics  
        Member of Pi Mu Epsilon, a national honorary mathematics society
---

To provide a highlights section set doHighlights: yes and create a highlights tag.

Each bullet in the highlights entry should be a list slot started by -. For example.

---
doHighlights: yes
highlights:
-   bullet: Author of \emph{Pulitzer Prize} winning article
-   bullet: Organizer of \textbf{Glasses and Cowl} Meetup
-   bullet: Analyzed global survey by the \textbf{Surveyors Inc}
-   bullet: Professor of Journalism at \textbf{Hudson University}
-   bullet: Thesis on \textbf{Facial Recognition Errors}
-   bullet: Served as reporter in \textbf{Vientiane, Laos}
---

Jobs and details are selected for display by building a list of lists named jobList. Each inner list represents a job and should have three unnamed elements: – CompanyNameJobName – Vector of BulletNames

An example is:

jobList <- list(
    list("Pied Piper", "Tech Startup", c(1, 3)),
    list("Goliath National Bank", "Large Bank", 1:2),
    list("Goliath National Bank", "Bank Intern", 1:3),
    list("Surveyors Inc", "Survery Stats", 1:2),
    list("Daily Planet", "Reporting", 2:4),
    list("Hudson University", "Professor", c(1, 3:4)),
    list("Hooli", "Coding Intern", c(1:3))
)

Research is specified similarly in researchList.

# generate a list of lists of research that list the company name, job name and bullet
researchList <- list(
    list("Hudson University", "Oddie Research", 4:5),
    list("Daily Planet", "Winning Article", 2)
)

The job file is read into the jobs variable using read.csv2.

library(resumer)
jobs <- read.csv2(params$JobFile, header=TRUE, sep=',', stringsAsFactors=FALSE)

The jobs and details are written to LaTeX using a code chunk with results='asis'.

Same with research details.

Regular LaTeX code can be used, such as in specifying an athletics section. Note that this uses a special rSection environment.

\begin{rSection}{Athletics}
\textbf{Ice Hockey} \emph{Goaltender} | \textbf{Hudson University} | 2000--2004 \\
\textbf{Curling} \emph{Vice Skip} | \textbf{Hudson University} | 2000--2004
\end{rSection}

A complete template is available when creating a new file in RStudio.

Any suggestions or, even better, pull requests are welcome at the GitHub page.


  1. A helper function, createJobFile, creates a CSV with the correct headers.

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Snowstorm Stella impacted both our numbers and our location, but last night a smaller crew braved the cold weather and messy streets to celebrate Pi Day with pizza and Pi Cake at Ribalta.

We naturally ate a lot of round pies and even a rectangular pie to honor Hippocrates’ squaring the lune.

This year’s Pi Cake came from Empire Cakes for the third year in a row.  It was their Brooklyn Blackout cake with Chocolate frosting, a blue Pi symbol on top and blue circles with red radii around the sides.

Some pictures from last night:

IMG_20170314_224825_430 IMG_20170314_225301_523 IMG_1967 IMG_20170314_201119 IMG_20170314_205344

And all the years’ Pi Cakes:

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.









I’m speaking in a few places over the next few weeks, so rather than just giving people a day’s notice I figured I should lay it out a bit. Right now I have three public talks lined up with a few more about to solidify. Soon I will update this map to have past talks too.


Talk Event City Date
Modeling and Machine Learning in R ODSC San Francisco 2017-03-01
Scraping and Analyzing NFL Data Sloan Sports Analytics Conference Boston 2017-03-03
Fun with R New York R Conference New York 2017-04-21


Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Highlights from the 2016 New York R Conference

Originally posted on www.work-bench.com.

image

You might be asking yourself, “How was the 2016 New York R Conference?”

Well, if we had to sum it up in one picture, it would look a lot like this (thank you to Drew Conway for the slide & delivering the battle cry for data science in NYC):

image

Our 2nd annual, sold-out New York R Conference was back this year on April 8th & 9th at Work-Bench. Co-hosted with our friends at Lander Analytics, this year’s conference was bigger and better than ever, with over 250 attendees, and speakers from Airbnb, AT&T, Columbia University, eBay, Etsy, RStudio, Socure, and Tamr. In case you missed the conference or want to relive the excitement, all of the talks and slides are now live on the R Conference website.

With 30 talks, each 20 minutes long and two forty-minute keynotes, the topics of the presentations were just as diverse as the speakers. Vivian Peng gave an emotional talk on data visualization using non-visual senses and “The Feels.” Bryan Lewis measured the shadows of audience members to demonstrate the pros and cons of projection methods, and Daniel Lee talked about life, love, Stan, and March Madness. But, even with 32 presentations from a diverse selection of speakers, two dominant themes emerged: 1) Community and 2) Writing better code.

Given the amazing caliber of speakers and attendees, community was on everyone’s mind from the start. Drew Conway emoted the past, present, and future of data science in NYC, and spoke to the dangers of tearing down the tent we built. Joe Rickert from Microsoft discussed the R Consortium and how to become involved. Wes McKinney talked about community efforts in improving interoperability between data science languages with the new Feather data frame file format under the Apache Arrow project. Elena Grewal discussed how Airbnb’s data science team made changes to the hiring process to increase the number of female hires, and Andrew Gelman even talked about how your political opinions are shaped by those around you in his talk about Social Penumbras.

Writing better code also proved to be a dominant theme throughout the two day conference. Dan Chen of Lander Analytics talked about implementing tests in R. Similarly, Neal Richardson and Mike Malecki of Crunch.io talked about how they learned to stop munging and love tests, and Ben Lerner discussed how to optimize Python code using profilers and Cython. The perfect intersection of themes came from Bas van Schaik of Semmle who discussed how to use data science to write better code by treating code as data. While everyone had some amazing insights, these were our top five highlights:

JJ Allaire Releases a New Preview of RStudio

image

JJ Allaire, the second speaker of the conference, got the crowd fired up by announcing new features of RStudio and new packages. Particularly exciting was bookdown for authoring large documents, R Notebooks for interactive Markdown files and shared sessions so multiple people can code together from separate computers.

Andrew Gelman Discusses the Political Impact of the Social Penumbra

image

As always, Dr. Andrew Gelman wowed the crowd with his breakdown of how political opinions are shaped by those around us. He utilized his trademark visualizations and wit to convey the findings of complex models.

Vivian Peng Helps Kick off the Second Day with a Punch to the Gut

image

On the morning of the second day of the conference, Vivian Peng gave a heartfelt talk on using data visualization and non-visual senses to drive emotional reaction and shape public opinion on everything from the Syrian civil war to drug resistance statistics.

Ivor Cribben Studies Brain Activity with Time Varying Networks

image

University of Alberta Professor Ivor Cribben demonstrated his techniques for analyzing fMRI data. His use of network graphs, time series and extremograms brought an academic rigor to the conference.

Elena Grewal Talks About Scaling Data Science at Airbnb

image

After a jam-packed 2 full days, Elena Grewal helped wind down the conference with a thoughtful introspection on how Airbnb has grown their data science team from 5 to 70 people, with a focus on increasing diversity and eliminating bias in the hiring process.

See the full conference videos & presentations below, and sign up for updates for the 2017 New York R Conference on www.rstats.nyc. To get your R fix in the meantime, follow @nyhackr, @Work_Bench, and @rstatsnyc on Twitter, and check out the New York Open Programming Statistical Meetup or one of Work-Bench’s upcoming events!

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Ohio State Buckeyes defensive lineman Joey Bosa

Been a busy few weeks with the New York R Conference, speaking engagements, writing the second edition of R for Everyone and coding open source packages.  The most exciting news involves the news as the Wall Street Journal wrote an article about my NFL Draft work.

It is a great piece with some nice quotes from the Vikings General Manager Rick Spielman and ESPN’s legendary John Clayton that succinctly sums up the work I did and runs the numbers on a few select players.

So now I’ve been in the news for pizza, the lottery and football.  Fun mix.

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

MIT Sports Analytics Conference

Last year, as I embarked on my NFL sports statistics work, I attended the Sloan Sports Analytics Conference for the first time. A year later, after a very successful draft, I was invited to present an R workshop to the conference.

My time slot was up against Nate Silver so I didn’t expect many people to attend.    Much to my surprise when I entered the room every seat was taken, people were lining the walls and sitting in the aisles.

My presentation, which was unrelated to the work I did, analyzed the Giants’ probability of passing versus rushing and the probability of which receiver was targeted.  It is available at the talks section of my site.

After the talk I spent the rest of the day fielding questions and gave away copies of R for Everyone and an NYC Data Mafia shirt.

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.