The biggest event for me this year was completely outside of work and had nothing to do with statistics or R: I got married. We technically met at the Open Stats meetup and I did build our wedding website with RMarkdown, so R was still involved. We just returned from our around-the-world honeymoon so I thought the best way to track our travels would be with maps and globes using leaflet and threejs.

Before we get to any code, the following packages were used in making this post.

This was an extensive trip that, in addition to traditional vacation activities, included a few visits to clients and speaking and a few conferences and meetups. In all, we visited, London, Singapore, Hong Kong, Auckland, Queenstown, Bora Bora, Tahiti, Moorea, San Jose and San Francisco, with a connection or two in between.

The airport/ferry codes for our trip were the following.

Origin	Destination	Airline
JFK	LGW	Norwegian Air
LHR	SIN	Singapore Airlines
SIN	HKG	Singapore Airlines
HKG	AKL	Cathay Pacific
AKL	ZQN	Air New Zealand
ZQN	AKL	Air New Zealand
AKL	PPT	Air New Zealand
PPT	BOB	Air Tahiti
BOB	MOZ	Air Tahiti
MOZ	PPT	Terevau
PPT	LAX	Air Tahiti Nui
LAX	SJC	Alaska Airlines
SFO	JFK	JetBlue

Converting these to latitude and longitude is easy thanks to Open Flights.

# read in the data
airports <- readr::read_csv('https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports-extended.dat',
                            # give it good column names since the data are headerless
                            col_names=c('ID', 'Name', 'City', 'Country', 
                                        'IATA', 'ICAO', 'Latitude', 'Longitude', 
                                        'Altitude', 'Timezone', 'DST', 'Tz', 
                                        'Type', 'Source'))

We then use filter to get just the ports we visited. Notice how we use a second tbl inside filter.

visited <- airports %>% 
    select(Name, City, Country, IATA, Latitude, Longitude) %>% 
    filter(IATA %in% (codes %>% select(Origin, Destination) %>% unlist))

DT::datatable(visited, elementId='AirportsTable',
              rownames=FALSE,
              extensions=c('FixedHeader', 'Scroller'),
              options=list(
                  dom='<"top"f>rt<"bottom"i><"clear">'
                  ,
                  scrollY=200,
                  scroller=TRUE
              )
) %>% 
    DT::formatRound(columns=c('Latitude', 'Longitude'), digits=2)

We then manually reorder the airports so that edges can be drawn nicely between them. This is akin to creating an edgelist of airport-pairs. This is not the most robust way of creating this list, but suffices for our purposes.

visitedOrdered <- visited %>% 
    slice(c(12, 1, 2, 8, 7, 5, 6, 5, 13, 3, 4, 13, 10, 11, 12))

For the first visualization let’s create a map using leaflet.

# initialize the widget
leaflet(data=visitedOrdered) %>% 
    # overlay map tiles
    addTiles() %>% 
    # plot lines from one point to another
    addPolylines(lng=~Longitude, lat=~Latitude) %>% 
    # add markers with city names
    addMarkers(lng=~Longitude, lat=~Latitude, popup=~City)

Unfortunately, this doesn’t quite capture the directions of the flights as it makes it look like we flew back west to get to Papeete. So let’s try a globe instead using threejs.

We augment the edgelist of airports so that it has the latitude and longitude of the origin and destination airports for each flight.

flightPaths <- codes %>% 
    left_join(visited %>% select(IATA, Longitude, Latitude), by=c('Origin'='IATA')) %>% 
    rename(oLong=Longitude, oLat=Latitude) %>% 
    left_join(visited %>% select(IATA, Longitude, Latitude), by=c('Destination'='IATA')) %>% 
    rename(dLong=Longitude, dLat=Latitude)

DT::datatable(visited, elementId='FlightPathLatLong',
              rownames=FALSE,
              extensions=c('FixedHeader', 'Scroller'),
              options=list(
                  dom='<"top"f>rt<"bottom"i><"clear">'
                  ,
                  scrollY=200,
                  scroller=TRUE
              )
) %>% 
    DT::formatRound(columns=c('Latitude', 'Longitude'), digits=2)

Now we can provide that data to threejs. We first specify an image to overlay on the globe. Then we specify the latitude and longitude of visited airports. After that, we provide the origin and destination latitudes and longitudes of our flights. The rest of the arguments are cosmetic.

globejs(
    # the image to overlay on the globe
    img="http://eoimages.gsfc.nasa.gov/images/imagerecords/73000/73909/world.topo.bathy.200412.3x5400x2700.jpg",
    # lat/long of visited airports
    lat=visited$Latitude, long=visited$Longitude,
    # lat/long of origin and destination
    arcs=flightPaths %>% select(oLat, oLong, dLat, dLong),
    # cosmetic adjustments
    arcsHeight=.4, arcsLwd=7, arcsColor="red", arcsOpacity=.95,
    atmosphere=FALSE, fov=30, rotationlat=0.3, rotationlong=.8*pi)

We now calculate the total distance traveled (not including car trips) using Haversine Distance to account for the curvature of the Earth.

distHaversine(visitedOrdered %>% select(Longitude, Latitude), r=3959) %>% sum

## [1] 28660.52

So we traveled 3,760 more miles than the circumference of the Earth.

Beyond the epic proportions of our travel, this honeymoon was outstanding from the sheer length, to the vastly different places we visited, to the food we ate and the sights we saw, to the activities we participated in, to the great people along the way. And, of course, it’s amazing to spend a month traveling with your favorite person.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

This is the code for a webinar I gave with Dan Mbanga for Amazon’s AWS Webinar Series about deep learning using MXNet in R. The video is available on YouTube.

I am experimentally saving htmlwidget objects to HTML files then loading them with iframes so please excuse the scroll bars.

Packages

These are the packages we are using either by loading the entire package or using individual functions.

Data

This dataset is about property lots in Manhattan and includes descriptive information as well as value. The original data are available from NYC Planning and the prepared files seen here at the Lander Analytics data.world repo.

data_train <- readr::read_csv(file.path(dataDir, 'manhattan_Train.csv'))
data_validate <- readr::read_csv(file.path(dataDir, 'manhattan_Validate.csv'))
data_test <- readr::read_csv(file.path(dataDir, 'manhattan_Test.csv'))

# Remove some variables
data_train <- data_train %>% 
    select(-ID, -TotalValue, -Borough, -ZoneDist4, 
           -SchoolDistrict, -Council, -PolicePrct, -HealthArea)
data_validate <- data_validate %>% 
    select(-ID, -TotalValue, -Borough, -ZoneDist4, 
           -SchoolDistrict, -Council, -PolicePrct, -HealthArea)
data_test <- data_test %>% 
    select(-ID, -TotalValue, -Borough, -ZoneDist4, 
           -SchoolDistrict, -Council, -PolicePrct, -HealthArea)

This is a glimpse of the data.

Here is a visualization of the class balance.

dataList <- list(data_train, data_validate, data_test)
dataList %>% 
    purrr::map(function(x) figure(width=600/NROW(dataList), height=500, legend_location=NULL) %>% 
                   ly_bar(x=High, color=factor(High), data=x, legend='code')
    ) %>% 
    grid_plot(nrow=1, ncol=NROW(dataList), same_axes=TRUE)

We use vtreat to do some automated feature engineering.

# The column name for the response
responseName <- 'High'
# The target value for the response
responseTarget <- TRUE
# The remaining columns are predictors
varNames <- setdiff(names(data_train), responseName)

# build the treatment design
treatmentDesign <- designTreatmentsC(dframe=data_train, varlist=varNames, 
                                     outcomename=responseName, 
                                     outcometarget=responseTarget, 
                                     verbose=TRUE)

## [1] "desigining treatments Mon Jun 26 01:23:38 2017"
## [1] "designing treatments Mon Jun 26 01:23:38 2017"
## [1] " have level statistics Mon Jun 26 01:23:39 2017"
## [1] "design var FireService Mon Jun 26 01:23:39 2017"
## [1] "design var ZoneDist1 Mon Jun 26 01:23:39 2017"
## [1] "design var ZoneDist2 Mon Jun 26 01:23:40 2017"
## [1] "design var ZoneDist3 Mon Jun 26 01:23:40 2017"
## [1] "design var Class Mon Jun 26 01:23:41 2017"
## [1] "design var LandUse Mon Jun 26 01:23:42 2017"
## [1] "design var Easements Mon Jun 26 01:23:43 2017"
## [1] "design var OwnerType Mon Jun 26 01:23:43 2017"
## [1] "design var LotArea Mon Jun 26 01:23:43 2017"
## [1] "design var BldgArea Mon Jun 26 01:23:43 2017"
## [1] "design var ComArea Mon Jun 26 01:23:43 2017"
## [1] "design var ResArea Mon Jun 26 01:23:44 2017"
## [1] "design var OfficeArea Mon Jun 26 01:23:44 2017"
## [1] "design var RetailArea Mon Jun 26 01:23:44 2017"
## [1] "design var GarageArea Mon Jun 26 01:23:44 2017"
## [1] "design var StrgeArea Mon Jun 26 01:23:44 2017"
## [1] "design var FactryArea Mon Jun 26 01:23:44 2017"
## [1] "design var OtherArea Mon Jun 26 01:23:44 2017"
## [1] "design var NumBldgs Mon Jun 26 01:23:44 2017"
## [1] "design var NumFloors Mon Jun 26 01:23:44 2017"
## [1] "design var UnitsRes Mon Jun 26 01:23:44 2017"
## [1] "design var UnitsTotal Mon Jun 26 01:23:44 2017"
## [1] "design var LotFront Mon Jun 26 01:23:44 2017"
## [1] "design var LotDepth Mon Jun 26 01:23:44 2017"
## [1] "design var BldgFront Mon Jun 26 01:23:44 2017"
## [1] "design var BldgDepth Mon Jun 26 01:23:45 2017"
## [1] "design var Extension Mon Jun 26 01:23:45 2017"
## [1] "design var Proximity Mon Jun 26 01:23:45 2017"
## [1] "design var IrregularLot Mon Jun 26 01:23:45 2017"
## [1] "design var LotType Mon Jun 26 01:23:46 2017"
## [1] "design var BasementType Mon Jun 26 01:23:46 2017"
## [1] "design var Landmark Mon Jun 26 01:23:47 2017"
## [1] "design var BuiltFAR Mon Jun 26 01:23:47 2017"
## [1] "design var ResidFAR Mon Jun 26 01:23:47 2017"
## [1] "design var CommFAR Mon Jun 26 01:23:47 2017"
## [1] "design var FacilFAR Mon Jun 26 01:23:47 2017"
## [1] "design var Built Mon Jun 26 01:23:47 2017"
## [1] "design var HistoricDistrict Mon Jun 26 01:23:48 2017"
## [1] " scoring treatments Mon Jun 26 01:23:48 2017"
## [1] "have treatment plan Mon Jun 26 01:24:19 2017"
## [1] "rescoring complex variables Mon Jun 26 01:24:19 2017"
## [1] "done rescoring complex variables Mon Jun 26 01:24:32 2017"

Then we create train, validate and test matrices.

# build design data.frames
dataTrain <- prepare(treatmentplan=treatmentDesign, dframe=data_train)
dataValidate <- prepare(treatmentplan=treatmentDesign, dframe=data_validate)
dataTest <- prepare(treatmentplan=treatmentDesign, dframe=data_test)

# use all the level names as predictors
predictorNames <- setdiff(names(dataTrain), responseName)

# training matrices
trainX <- data.matrix(dataTrain[, predictorNames])
trainY <- dataTrain[, responseName]

# validation matrices
validateX <- data.matrix(dataValidate[, predictorNames])
validateY <- dataValidate[, responseName]

# test matrices
testX <- data.matrix(dataTest[, predictorNames])
testY <- dataTest[, responseName]

# Sparse versions for some models
trainX_sparse <- sparse.model.matrix(object=High ~ ., data=dataTrain)
validateX_sparse <- sparse.model.matrix(object=High ~ ., data=dataValidate)
testX_sparse <- sparse.model.matrix(object=High ~ ., data=dataTest)

Feedforward Network

Helper Functions

This is a function that allows mxnet to calculate log-loss based on the logloss function from the Metrics package.

# log-loss
mx.metric.mlogloss <- mx.metric.custom("mlogloss", function(label, pred){
    return(Metrics::logLoss(label, pred))
})

Network Formulation

We build the model symbolically. We use a feedforward network with two hidden layers. The first hidden layer has 256 units and the second has 128 units. We also use dropout and batch normalization for regularization. The last step is to use a logistic sigmoid (inverse logit) for the logistic regression output.

net <- mx.symbol.Variable('data') %>%
    # drop out 20% of predictors
    mx.symbol.Dropout(p=0.2, name='Predictor_Dropout') %>%
    # a fully connected layer with 256 units
    mx.symbol.FullyConnected(num_hidden=256, name='fc_1') %>%
    # batch normalize the units
    mx.symbol.BatchNorm(name='bn_1') %>%
    # use the rectified linear unit (relu) for the activation function
    mx.symbol.Activation(act_type='relu', name='relu_1') %>%
    # drop out 50% of the units
    mx.symbol.Dropout(p=0.5, name='dropout_1') %>%
    # a fully connected layer with 128 units
    mx.symbol.FullyConnected(num_hidden=128, name='fc_2') %>%
    # batch normalize the units
    mx.symbol.BatchNorm(name='bn_2') %>%
    # use the rectified linear unit (relu) for the activation function
    mx.symbol.Activation(act_type='relu', name='relu_2') %>%
    # drop out 50% of the units
    mx.symbol.Dropout(p=0.5, name='dropout_2') %>%
    # fully connect to the output layer which has just the 1 unit
    mx.symbol.FullyConnected(num_hidden=1, name='out') %>%
    # use the sigmoid output
    mx.symbol.LogisticRegressionOutput(name='output')

Inspect the Network

By inspecting the symbolic network we see that it is actually just a C++ pointer. We also see its arguments and a visualization.

net

## C++ object <0000000018ed9aa0> of class 'MXSymbol' <0000000018c30d00>

arguments(net)

##  [1] "data"         "fc_1_weight"  "fc_1_bias"    "bn_1_gamma"  
##  [5] "bn_1_beta"    "fc_2_weight"  "fc_2_bias"    "bn_2_gamma"  
##  [9] "bn_2_beta"    "out_weight"   "out_bias"     "output_label"

graph.viz(net)

Network Training

With the data prepared and the network specified we now train the model. First we set the envinronment variable MXNET_CPU_WORKER_NTHREADS=4 since this demo is on a laptop with four threads. Using a GPU will speed up the computations. We also set the random seed with mx.set.seed for reproducibility.

We use the Adam optimization algorithm which has an adaptive learning rate which incorporates momentum.

# use four CPU threads
Sys.setenv('MXNET_CPU_WORKER_NTHREADS'=4)

# set the random seed
mx.set.seed(1234)

# train the model
mod_net <- mx.model.FeedForward.create(
    symbol            = net,    # the symbolic network
    X                 = trainX, # the predictors
    y                 = trainY, # the response
    optimizer         = "adam", # using the Adam optimization method
    eval.data         = list(data=validateX, label=validateY), # validation data
    ctx               = mx.cpu(), # use the cpu for training
    eval.metric       = mx.metric.mlogloss, # evaluate with log-loss
    num.round         = 50,     # 50 epochs
    learning.rate     = 0.001,   # learning rate
    array.batch.size  = 256,    # batch size
    array.layout      = "rowmajor"  # the data is stored in row major format
)

Predictions

Statisticians call this step prediction while the deep learning field calls it inference which has an entirely different meaning in statistics.

preds_net <- predict(mod_net, testX, array.layout="rowmajor") %>% t

Elastic Net

Model Training

We fit an Elastic Net model with glmnet.

registerDoParallel(cl=4)

set.seed(1234)
mod_glmnet <- cv.glmnet(x=trainX_sparse, y=trainY, 
                        alpha=.5, family='binomial', 
                        type.measure='auc',
                        nfolds=5, parallel=TRUE)

Predictions

preds_glmnet <- predict(mod_glmnet, newx=testX_sparse, s='lambda.1se', type='response')

XGBoost

Model Training

We fit a random forest with xgboost.

set.seed(1234)

trainXG <- xgb.DMatrix(data=trainX_sparse, label=trainY)
validateXG <- xgb.DMatrix(data=validateX_sparse, label=validateY)

watchlist <- list(train=trainXG, validate=validateXG)

mod_xgboost <- xgb.train(data=trainXG, 
                nrounds=1, nthread=4, 
                num_parallel_tree=500, subsample=0.5, colsample_bytree=0.5,
                objective='binary:logistic',
                eval_metric = "error", eval_metric = "logloss",
                print_every_n=1, watchlist=watchlist)

## [1]  train-error:0.104713    train-logloss:0.525032  validate-error:0.108254 validate-logloss:0.527535

Predictions

preds_xgboost <- predict(mod_xgboost, newdata=testX_sparse)

SVM

Model Training

set.seed(1234)
mod_svm <- e1071::svm(x=trainX_sparse, y=trainY, probability=TRUE, type='C')

This model did not train in a reasonable time.

Results

ROC

rocData <- dplyr::bind_rows(
    cbind(data.frame(roc(testY, preds_glmnet, direction="<")[c('specificities', 'sensitivities')]), Model='glmnet'),
    cbind(data.frame(roc(testY, preds_xgboost, direction="<")[c('specificities', 'sensitivities')]), Model='xgboost'),
    cbind(data.frame(roc(testY, preds_net, direction="<")[c('specificities', 'sensitivities')]), Model='Net')
)

ggplotly(ggplot(rocData, aes(x=specificities, y=sensitivities)) + geom_line(aes(color=Model, group=Model)) + scale_x_reverse(), width=800, height=600)

Shortly after I learned LaTeX I used it to write my resume (or CV if you will), freeing me from the headache of using Microsoft Word and the associated formatting troubles. Even that wasn’t enough though because different audiences needed different information and job listings. I could have stored all the information in the file and commented out bullet points I did not want to use, but that seemed sloppy. So instead I wrote an R package called resumer.

The trick is to store all of the data in a CSV, one row per bullet point.¹

JobName	Company	Location	Title	Start	End	Bullet	BulletName	Type	Description
Tech Startup	Pied Piper	New York, NY	CTO	2013	Present	Set up company’s computing platform	1	Job	NA
Tech Startup	Pied Piper	New York, NY	CTO	2013	Present	Designed data strategy overseeing many datasources	2	Job	NA
Tech Startup	Pied Piper	New York, NY	CTO	2013	Present	Constructed statistical models for predictive analytics of big data	3	Job	NA
Large Bank	Goliath National Bank	New York, NY	Quant	2011	2013	Built quantitative models for derivatives trades	1	Job	NA
Large Bank	Goliath National Bank	New York, NY	Quant	2011	2013	Wrote algorithms using the R statistical programming language	2	Job	NA
Bank Intern	Goliath National Bank	New York, NY	Intern	2010	NA	Got coffee for senior staff	1	Job	NA

Each row represents a detail about a job. So a job may take multiple rows.

The columns are:

JobName: Name identifying this job. This is identifying information used when selecting which jobs to display.
Company: Name of company.
Location: Physical location of job.
Title: Title held at job.
Start: Start date of job, usually represented by a year.
End: End date of job. This would ordinarily by a year, ‘Present’ or blank.
Bullet: The detail about the job.
BulletName: Identifier for this detail, used when selecting which details to display.
Type: Should be either Job or Research.
Description: Used for a quick blurb about research roles.

There are many parts to using this package which are all explained in the README and mostly reproduced here.

The yaml header holds your name, address, the location of the jobs CSV file, education information and any highlights. Remember, proper indenting is required for yaml.

The name and address fields are self explanatory. output takes the form of package::function which for this package is resumer::resumer.

The location of the jobs CSV is specified in the JobFile slot of the params entry. This should be the absolute path to the CSV.

These would look like this.

---
name: "Generic Name"
address: "New York"
output: resumer::resumer
params:
    JobFile: "examples/jobs.csv"
---

Supplying education information is done as a list in the education entry, with each school containing slots for school, dates and optionally notes. Each slot of the list is started with a -. The notes slot starts with a | and each line (except the last line) must end with two spaces.

For example:

---
education:
-   school: "Hudson University"
    dates: "2007--2009"
    notes: |
        GPA 3.955  
        Master of Arts in Statistics
-   school: "Smallville College"
    dates: "2000--2004"
    notes: |
        Cumulative GPA 3.838 Summa Cum Laude, Honors in Mathematics  
        Bachelor of Science in Mathematics, Journalism Minor  
        The Wayne Award for Excellence in Mathematics  
        Member of Pi Mu Epsilon, a national honorary mathematics society
---

To provide a highlights section set doHighlights: yes and create a highlights tag.

Each bullet in the highlights entry should be a list slot started by -. For example.

---
doHighlights: yes
highlights:
-   bullet: Author of \emph{Pulitzer Prize} winning article
-   bullet: Organizer of \textbf{Glasses and Cowl} Meetup
-   bullet: Analyzed global survey by the \textbf{Surveyors Inc}
-   bullet: Professor of Journalism at \textbf{Hudson University}
-   bullet: Thesis on \textbf{Facial Recognition Errors}
-   bullet: Served as reporter in \textbf{Vientiane, Laos}
---

Jobs and details are selected for display by building a list of lists named jobList. Each inner list represents a job and should have three unnamed elements: – CompanyName – JobName – Vector of BulletNames

An example is:

jobList <- list(
    list("Pied Piper", "Tech Startup", c(1, 3)),
    list("Goliath National Bank", "Large Bank", 1:2),
    list("Goliath National Bank", "Bank Intern", 1:3),
    list("Surveyors Inc", "Survery Stats", 1:2),
    list("Daily Planet", "Reporting", 2:4),
    list("Hudson University", "Professor", c(1, 3:4)),
    list("Hooli", "Coding Intern", c(1:3))
)

Research is specified similarly in researchList.

# generate a list of lists of research that list the company name, job name and bullet
researchList <- list(
    list("Hudson University", "Oddie Research", 4:5),
    list("Daily Planet", "Winning Article", 2)
)

The job file is read into the jobs variable using read.csv2.

library(resumer)
jobs <- read.csv2(params$JobFile, header=TRUE, sep=',', stringsAsFactors=FALSE)

The jobs and details are written to LaTeX using a code chunk with results='asis'.

Same with research details.

Regular LaTeX code can be used, such as in specifying an athletics section. Note that this uses a special rSection environment.

\begin{rSection}{Athletics}
\textbf{Ice Hockey} \emph{Goaltender} | \textbf{Hudson University} | 2000--2004 \\
\textbf{Curling} \emph{Vice Skip} | \textbf{Hudson University} | 2000--2004
\end{rSection}

A complete template is available when creating a new file in RStudio.

Any suggestions or, even better, pull requests are welcome at the GitHub page.

A helper function, createJobFile, creates a CSV with the correct headers.↩

Snowstorm Stella impacted both our numbers and our location, but last night a smaller crew braved the cold weather and messy streets to celebrate Pi Day with pizza and Pi Cake at Ribalta.

We naturally ate a lot of round pies and even a rectangular pie to honor Hippocrates’ squaring the lune.

This year’s Pi Cake came from Empire Cakes for the third year in a row. It was their Brooklyn Blackout cake with Chocolate frosting, a blue Pi symbol on top and blue circles with red radii around the sides.

Some pictures from last night:

And all the years’ Pi Cakes:

[Show slideshow]

I’m speaking in a few places over the next few weeks, so rather than just giving people a day’s notice I figured I should lay it out a bit. Right now I have three public talks lined up with a few more about to solidify. Soon I will update this map to have past talks too.

Talk	Event	City	Date
Modeling and Machine Learning in R	ODSC	San Francisco	2017-03-01
Scraping and Analyzing NFL Data	Sloan Sports Analytics Conference	Boston	2017-03-03
Fun with R	New York R Conference	New York	2017-04-21

Highlights from the 2016 New York R Conference

Originally posted on www.work-bench.com.

You might be asking yourself, “How was the 2016 New York R Conference?”

Well, if we had to sum it up in one picture, it would look a lot like this (thank you to Drew Conway for the slide & delivering the battle cry for data science in NYC):

Our 2nd annual, sold-out New York R Conference was back this year on April 8th & 9th at Work-Bench. Co-hosted with our friends at Lander Analytics, this year’s conference was bigger and better than ever, with over 250 attendees, and speakers from Airbnb, AT&T, Columbia University, eBay, Etsy, RStudio, Socure, and Tamr. In case you missed the conference or want to relive the excitement, all of the talks and slides are now live on the R Conference website.

With 30 talks, each 20 minutes long and two forty-minute keynotes, the topics of the presentations were just as diverse as the speakers. Vivian Peng gave an emotional talk on data visualization using non-visual senses and “The Feels.” Bryan Lewis measured the shadows of audience members to demonstrate the pros and cons of projection methods, and Daniel Lee talked about life, love, Stan, and March Madness. But, even with 32 presentations from a diverse selection of speakers, two dominant themes emerged: 1) Community and 2) Writing better code.

Given the amazing caliber of speakers and attendees, community was on everyone’s mind from the start. Drew Conway emoted the past, present, and future of data science in NYC, and spoke to the dangers of tearing down the tent we built. Joe Rickert from Microsoft discussed the R Consortium and how to become involved. Wes McKinney talked about community efforts in improving interoperability between data science languages with the new Feather data frame file format under the Apache Arrow project. Elena Grewal discussed how Airbnb’s data science team made changes to the hiring process to increase the number of female hires, and Andrew Gelman even talked about how your political opinions are shaped by those around you in his talk about Social Penumbras.

Writing better code also proved to be a dominant theme throughout the two day conference. Dan Chen of Lander Analytics talked about implementing tests in R. Similarly, Neal Richardson and Mike Malecki of Crunch.io talked about how they learned to stop munging and love tests, and Ben Lerner discussed how to optimize Python code using profilers and Cython. The perfect intersection of themes came from Bas van Schaik of Semmle who discussed how to use data science to write better code by treating code as data. While everyone had some amazing insights, these were our top five highlights:

JJ Allaire Releases a New Preview of RStudio

JJ Allaire, the second speaker of the conference, got the crowd fired up by announcing new features of RStudio and new packages. Particularly exciting was bookdown for authoring large documents, R Notebooks for interactive Markdown files and shared sessions so multiple people can code together from separate computers.

As always, Dr. Andrew Gelman wowed the crowd with his breakdown of how political opinions are shaped by those around us. He utilized his trademark visualizations and wit to convey the findings of complex models.

Vivian Peng Helps Kick off the Second Day with a Punch to the Gut

On the morning of the second day of the conference, Vivian Peng gave a heartfelt talk on using data visualization and non-visual senses to drive emotional reaction and shape public opinion on everything from the Syrian civil war to drug resistance statistics.

Ivor Cribben Studies Brain Activity with Time Varying Networks

University of Alberta Professor Ivor Cribben demonstrated his techniques for analyzing fMRI data. His use of network graphs, time series and extremograms brought an academic rigor to the conference.

Elena Grewal Talks About Scaling Data Science at Airbnb

After a jam-packed 2 full days, Elena Grewal helped wind down the conference with a thoughtful introspection on how Airbnb has grown their data science team from 5 to 70 people, with a focus on increasing diversity and eliminating bias in the hiring process.

See the full conference videos & presentations below, and sign up for updates for the 2017 New York R Conference on www.rstats.nyc. To get your R fix in the meantime, follow @nyhackr, @Work_Bench, and @rstatsnyc on Twitter, and check out the New York Open Programming Statistical Meetup or one of Work-Bench’s upcoming events!

Ohio State Buckeyes defensive lineman Joey Bosa

Been a busy few weeks with the New York R Conference, speaking engagements, writing the second edition of R for Everyone and coding open source packages. The most exciting news involves the news as the Wall Street Journal wrote an article about my NFL Draft work.

It is a great piece with some nice quotes from the Vikings General Manager Rick Spielman and ESPN’s legendary John Clayton that succinctly sums up the work I did and runs the numbers on a few select players.

So now I’ve been in the news for pizza, the lottery and football. Fun mix.

Last year, as I embarked on my NFL sports statistics work, I attended the Sloan Sports Analytics Conference for the first time. A year later, after a very successful draft, I was invited to present an R workshop to the conference.

My time slot was up against Nate Silver so I didn’t expect many people to attend. Much to my surprise when I entered the room every seat was taken, people were lining the walls and sitting in the aisles.

My presentation, which was unrelated to the work I did, analyzed the Giants’ probability of passing versus rushing and the probability of which receiver was targeted. It is available at the talks section of my site.

After the talk I spent the rest of the day fielding questions and gave away copies of R for Everyone and an NYC Data Mafia shirt.

Last night we celebrated Rounded Pi Day by rounding at the 10,000’s digit to get 3.1416 which nicely works with the date 3/14/16. This was great after Mega Pi Day worked out so perfectly last year. And this all built upon previous years’ celebrations.

We ate a large quantity of pizza at Lombardi’s. and for the second year in a row we got the Pi Cake from Empire Cakes with peanut butter and chocolate flavors. The base was inscribed with historic approximations of Pi: 25/8, 256/81, 339/108, 223/71, 377/120, 3927/1250, 355/113, 62832/20000, 22/7.

Some pictures from the fantastic night:

Previous year’s Pi Cakes:

[Show slideshow]

Related Posts

Packages

Data

Feedforward Network

Helper Functions

Network Formulation

Inspect the Network

Network Training

Predictions

Elastic Net

Model Training

Predictions

XGBoost

Model Training

Predictions

SVM

Model Training

Results

ROC

Related Posts

Related Posts

Related Posts

Related Posts

Highlights from the 2016 New York R Conference

JJ Allaire Releases a New Preview of RStudio

Andrew Gelman Discusses the Political Impact of the Social Penumbra

Vivian Peng Helps Kick off the Second Day with a Punch to the Gut

Ivor Cribben Studies Brain Activity with Time Varying Networks

Elena Grewal Talks About Scaling Data Science at Airbnb

Related Posts

Related Posts

Related Posts

Related Posts