So far this year I have logged many miles in the air and on the rails. In between trips to Minneapolis and Boston I spent about a month traveling through India and Southeast Asia, mainly to conduct R courses in Singapore and Kuala Lumpur for the likes of Intel, Micron, Celcom, Maxis, DBS and other similar companies. The training courses were organized through Revolution Analytics’ Singapore office. Given the success of the classes, there will be more opportunities this spring or summer in Singapore, Kuala Lumpur and also in Australia.
Quite a lot of material was covered based on the offerings of my company, Lander Analytics and the content of my R for Everyone.
Day 1 – Basics
- Getting and installing R
- The RStudio Environment
- The basics of R
- Variables
- Data Types
- Reading data
- Calling functions
- Missing Data
- Basic Math
- Advanced Data Structures
- data.frames
- lists
- matrices
- arrays
- Reading Data into R
- read.table
- RODBC
- Binary data
- Matrix Calculations
- Data Munging
- Writing functions
- Conditionals
- Loops
- String manipulation and regular expressions
- Visualization
- Base R
- ggplot2
Day 2 – Modeling
- Basic Statistics
- Probability Distributions
- Averages, standard deviations and correlations
- t-test
- Linear Models
- Generalized Linear Models
- Logistic Regression
- Poisson Regression
- Survival Analysis
- Assessing Model Quality
- MSE
- AIC
- BIC
- Residual Analysis
- Time Series
- Variable Selection
Day 3 – Machine Learning
- Variable selection for high dimensional data with glmnet
- Reduce uncertainty with weakly informative priors and Bayesian regression
- K-Means clustering
- Hierarchical clustering
- Multidimensional scaling
- Decision Trees for classification
- Random Forests for ensembling decision trees
- Bootstrap for measuring uncertainty
- Cross validation for model assessment
- Support Vector Machines
- Neural Networks
Day 4 – Data Presentation and Portability
- Reproducible reports using knitr
- Basic Introduction to Markdown
- Using knitr to automatically generate reports with embedded analytics
- Using Markdown and knitr to automatically generate websites with embedded analytics
- Using Markdown and knitr to make HTML5 slideshows with embedded analytics
- Advanced plotting
- Building R Packages
- Shiny Overview
Day 5 – High Performance Computing with R
- Benchmarking code using microbenchmark
- The different speeds of various aggregation functions
- aggregate
- tapply
- plyr
- data.table
- Fast manipulation using dplyr
- Running dplyr commands in a database
- Parallel Code
- Integrating C++
Given my extensive time abroad I thought it would be good to look at it all on a map using the Leaflet package in R.
Using the Google Maps API we can look up the latitude and longitude of the visited cities.
library(XML)
library(plyr)
cities <- c('Hong Kong', 'Haripal, India', 'Kolkata, India', 'Jaipur, India', 'Agra, India', 'Delhi, India',
'Singapore', 'Kuala Lumpur, Malaysia', 'Geroge Town, Malaysia')
lat.long <- function(place)
{
theURL <- sprintf('http://maps.google.com/maps/api/geocode/xml?sensor=false&address=%s', place)
doc <- xmlToList(theURL)
data.frame(Place=place, Latitude=as.numeric(doc$result$geometry$location$lat), Longitude=as.numeric(doc$result$geometry$location$lng), stringsAsFactors=FALSE)
}
places <- adply(cities, 1, lat.long)
knitr::kable(places[, -1], digits=3, row.names=FALSE)
Place | Latitude | Longitude |
---|---|---|
Hong Kong | 22.396 | 114.109 |
Haripal, India | 22.817 | 88.105 |
Kolkata, India | 22.573 | 88.364 |
Jaipur, India | 26.912 | 75.787 |
Agra, India | 27.177 | 78.008 |
Delhi, India | 28.614 | 77.209 |
Singapore | 1.352 | 103.820 |
Kuala Lumpur, Malaysia | 3.139 | 101.687 |
Geroge Town, Malaysia | 5.415 | 100.330 |
Now that we have the coordinates we use Leaflet to plot them.
library(leaflet)
leaflet(data=places) %>% addTiles() %>% setView(90, 15, zoom=4) %>% addPopups(lng=~Longitude, lat=~Latitude, popup=~Place) %>% addPolylines(~Longitude, ~Latitude, data=places[c(1, 3, 2:9, 1), ]) %>% addMarkers(lng=~Longitude, lat=~Latitude, popup=~Place, icon=JS("L.icon({iconUrl: 'https://www.jaredlander.com/images/jaredlanderfavicon.png', iconSize: [20, 20]})"))
Calculating all the miles traveled could be as simple as looking it up on TripIt, or we could do some quick Haversine distance calculations with the geosphere package.
First, we get the coordinates for New York, Minneapolis and Boston to have a complete picture of the distance.
newCities <- adply(c('New York, NY', 'Minneapolis, MN', 'Boston, MA'), 1, lat.long)
allPlaces <- rbind(newCities[c(1, 2, 1), ], places[c(1, 3, 2:9, 1), ], newCities[c(1, 3, 1), ])
Then in order to use distHaversine
we need to set up a to and from relationship between the places. The easiest way will be to just shift the columns.
library(useful)
## Loading required package: ggplot2
shiftedPlaces <- shift.column(data=allPlaces, columns=names(places)[-1], newNames=c('To', 'Lat2', 'Long2'))
Now we can calculate the distance. This assumes that all trips followed a great circle, which might not be the case, especially for the car and rail portions of the trip.
library(geosphere)
## Loading required package: sp
shiftedPlaces$Distance <- distHaversine(shiftedPlaces[, c("Longitude", "Latitude")], shiftedPlaces[, c("Long2", "Lat2")], r=3959)
In total this led to 25,727 miles traveled.
knitr::kable(shiftedPlaces[, -1], digits=c(1, 3, 3, 1, 3, 3, 0), row.names=FALSE)
Place | Latitude | Longitude | To | Lat2 | Long2 | Distance |
---|---|---|---|---|---|---|
New York, NY | 40.713 | -74.006 | Minneapolis, MN | 44.978 | -93.265 | 1016 |
Minneapolis, MN | 44.978 | -93.265 | New York, NY | 40.713 | -74.006 | 1016 |
New York, NY | 40.713 | -74.006 | Hong Kong | 22.396 | 114.109 | 8046 |
Hong Kong | 22.396 | 114.109 | Kolkata, India | 22.573 | 88.364 | 1642 |
Kolkata, India | 22.573 | 88.364 | Haripal, India | 22.817 | 88.105 | 24 |
Haripal, India | 22.817 | 88.105 | Kolkata, India | 22.573 | 88.364 | 24 |
Kolkata, India | 22.573 | 88.364 | Jaipur, India | 26.912 | 75.787 | 844 |
Jaipur, India | 26.912 | 75.787 | Agra, India | 27.177 | 78.008 | 138 |
Agra, India | 27.177 | 78.008 | Delhi, India | 28.614 | 77.209 | 111 |
Delhi, India | 28.614 | 77.209 | Singapore | 1.352 | 103.820 | 2574 |
Singapore | 1.352 | 103.820 | Kuala Lumpur, Malaysia | 3.139 | 101.687 | 192 |
Kuala Lumpur, Malaysia | 3.139 | 101.687 | Geroge Town, Malaysia | 5.415 | 100.330 | 183 |
Geroge Town, Malaysia | 5.415 | 100.330 | Hong Kong | 22.396 | 114.109 | 1491 |
Hong Kong | 22.396 | 114.109 | New York, NY | 40.713 | -74.006 | 8046 |
New York, NY | 40.713 | -74.006 | Boston, MA | 42.360 | -71.059 | 190 |
Boston, MA | 42.360 | -71.059 | New York, NY | 40.713 | -74.006 | 190 |
leaflet(data=allPlaces) %>% addTiles() %>% setView(80, 20, zoom = 3) %>% addPolylines(~Longitude, ~Latitude) %>% addMarkers(lng=~Longitude, lat=~Latitude, popup=~Place, icon=JS("L.icon({
iconUrl: 'https://www.jaredlander.com/images/jaredlanderfavicon.png', iconSize: [20, 20]})"))
// add bootstrap table styles to pandoc tables $(document).ready(function () { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); });
Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.