We are so excited to bring back the fourth annual Government and Public Sector R Conference to a computer screen near you! Join us on December 9th and 10th for a jam-packed two-day conference filled with an incredible lineup of speakers across the government and public sector space.

Buy your tickets here!

Over the two day, you will hear talks from, to name them ALL; Alex Gold (RStudio), Wendy Martinez (Bureau of Labor Statistics), Marck Vaisman (Microsoft), Brooke Frye (NYC Council), Asmae Toumi (PursueCare), Tommy Jones (In-Q-Tel), Jordan Jasuta Fischer (IBM), Aaron Mannes (Department of Homeland Security), Benjamin Braun (202 Group), Mayari Montes de Oca (formerly NYU Global TIES for Children), Cezary Podkul (ProPublica), Sydney Coston (US Military Academy), Abhijit Dasgupta (Georgetown University), Vivian Peng (City of LA), Boriana Pratt (Princeton University), Madhava Jay (OpenMined), Lauren Lombardo (Harvard University), Jasmine Han (Bloomberg Industry Group), David Shor (Blue Rose Research), Jorge Luna (Aetna), Coline Zeballos (Roche Pharma), Surabhi Hodigere (Harvard University)…and myself, Jared Lander (Lander Analytics).

We even have a full day of workshops held the day before the conference (on Wednesday December 8th), also virtually online!

Understanding Climate Change Data with Alexandra Boghosian
Introduction to Machine Learning for Public Policy with Jonathan Hersh

Alexandra, one of my former students, plans to take you through data analysis on climate change, by exploring publicly available datasets that you may or may not have known about. As Alexandra describes it, “by the end of the day, attendees should have a working knowledge of how the data support the science, and where to gather data and information about specific climate change issues they may face in their work.”

If a workshop on climate change is not right for you, my close friend Jonathan Hersh is giving his workshop on machine learning for public policy. Jonathan will lead you through applied machine learning exercises with a focus on the public space. The session will cover basic concepts like supervised vs. unsupervised learning, testing and training sets, and the bias-variance tradeoff. Jonathan will also review linear regression, ridge regression, cross-validation, and lasso regression. He will cover R Language and syntax, data manipulation in R, exploratory data analysis, and basic plotting in R.

The aRt auction is back! The auction features pieces by artists from the R Community. All proceeds will be donated to the R Foundation, a not-for-profit organization committed to the continued development of R. Bidding will begin December 1st and winners will be announced December 10th in the afternoon during the conference!

I would like to extend a special thank you to my good friend, Marck Vaisman and the Data Community DC. This conference would not be possible without your continued support. And I thank you, Marck, for our close friendship over the years. It has been incredible to witness the growth of the meetup community right alongside you.

I would also like to thank the Georgetown Master’s program in data science & analytics for sponsoring our virtual event, and for hosting us at previous R Gov events.

The additional support of RStudio, PolicyViz, the R Consortium and Pearson goes a long way to making this conference such an informative and fun event.

To learn more about our speaker lineup, workshops, agenda and sponsors visit rstats.ai and landeranalytics.com. Also, don’t forget to follow us on Twitter at @rstatsai and @landeranalytics. And finally, if you are interested in being a sponsor, please contact us!

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science and AI firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Government Data Science and AI Conferences and author of R for Everyone.

In previous posts I talked about collecting temperature data from different rooms in my house and automating that process using Docker and GitHub Actions. Now it is time to analyze that data to figure out what is happening in the house to make informed decisions about replacing the HVAC units and how I can make the house more comfortable for everyone until then.

During this process I pull the data into R from a DigitalOcean space (like an AWS S3 bucket) using {arrow}, manipulate the data with {dplyr}, {tidyr} and {tsibble}, visualize the data with {timetk}, fit models with {fable} and visualize the models with {coefplot}.

Getting the Data

As discussed in the post about tracking the data, we created a CSV for each day of data in a DigitalOcean space. The CSV tracks room temperatures, thermostat settings, outdoor temperature, etc for every five-minute interval during each day.

In the past, if multiple files had to be read into a single data.frame or tibble the best course of action would have been to combine map_df() from {purrr} with read_csv() or fread() from {readr} and {data.table}, respectively. But as seen in Wes McKinney and Neal Richardson’s talk from the 2020 New York R Conference, using open_data() from {arrow} then collecting the data with dplyr::collect() should be faster. An added bonus is that {arrow} has functionality built in to pull from an S3 bucket, and that includes DigitalOcean spaces.

The first step is to use arrow::S3FileSystem$create() to make a reference to the S3 file system. Several pieces of information are needed for this:

access_key: The DigitalOcean Access Key created at https://cloud.digitalocean.com/account/api/tokens.
secret_key: The DigitalOcean Secret Key created at https://cloud.digitalocean.com/account/api/tokens.
scheme: This should just be "https".
endpoint_override: Since we are using DigitalOcean and not AWS we need to change the default to something along the lines of "nyc3.digitaloceanspaces.com", depending on the region.

The Access Key can be retrieved at any time, but the Secret Key is only displayed one time, so we must save it somewhere.

This is all the same information used when writing the files to the DigitalOcean space in the first place. We also save it all in environment variables in the .Renviron file to avoid exposing this sensitive information in our code. It is very important to not check this file into git to reduce risk of exposing the private keys.

space <- arrow::S3FileSystem$create(
  access_key=Sys.getenv('AWS_ACCESS_KEY_ID'),
  secret_key=Sys.getenv('AWS_SECRET_ACCESS_KEY'),
  scheme="https",
  endpoint_override=glue::glue("{Sys.getenv('DO_REGION')}.{Sys.getenv('DO_BASE')}")
)
space

## S3FileSystem

The space object represents the S3 file system we have access to. Its elements are mostly file system type functions such as cd() ls(), CopyFile() and GetFileInfo() and are accessed via space$function_name(). To see a listing of files in a folder inside a bucket we call space$ls() with the bucket and folder name as the argument, in quotes.

space$ls('bucket_name/folder_name')

##  [1] "bucket_name/folder_name/2021-01-01.csv"
##  [2] "bucket_name/folder_name/2021-01-02.csv"
##  [3] "bucket_name/folder_name/2021-01-03.csv"
##  [4] "bucket_name/folder_name/2021-01-04.csv"
##  [5] "bucket_name/folder_name/2021-01-05.csv"
##  [6] "bucket_name/folder_name/2021-01-06.csv"
##  [7] "bucket_name/folder_name/2021-01-07.csv"
##  [8] "bucket_name/folder_name/2021-01-08.csv"
##  [9] "bucket_name/folder_name/2021-01-09.csv"
## [10] "bucket_name/folder_name/2021-01-10.csv"
## [11] "bucket_name/folder_name/2021-01-11.csv"
## [12] "bucket_name/folder_name/2021-01-12.csv"
## [13] "bucket_name/folder_name/2021-01-13.csv"
## [14] "bucket_name/folder_name/2021-01-14.csv"
## [15] "bucket_name/folder_name/2021-01-15.csv"
## [16] "bucket_name/folder_name/2021-01-16.csv"
## [17] "bucket_name/folder_name/2021-01-17.csv"
## [18] "bucket_name/folder_name/2021-01-18.csv"
## [19] "bucket_name/folder_name/2021-01-19.csv"
## [20] "bucket_name/folder_name/2021-01-20.csv"
## [21] "bucket_name/folder_name/2021-01-21.csv"
## [22] "bucket_name/folder_name/2021-01-22.csv"
## [23] "bucket_name/folder_name/2021-01-23.csv"
## [24] "bucket_name/folder_name/2021-01-24.csv"
## [25] "bucket_name/folder_name/2021-01-25.csv"
## [26] "bucket_name/folder_name/2021-01-26.csv"
## [27] "bucket_name/folder_name/2021-01-27.csv"
## [28] "bucket_name/folder_name/2021-01-28.csv"
## [29] "bucket_name/folder_name/2021-01-29.csv"
## [30] "bucket_name/folder_name/2021-01-30.csv"
## [31] "bucket_name/folder_name/2021-01-31.csv"
## [32] "bucket_name/folder_name/2021-02-01.csv"
## [33] "bucket_name/folder_name/2021-02-02.csv"
## [34] "bucket_name/folder_name/2021-02-03.csv"
## [35] "bucket_name/folder_name/2021-02-04.csv"
## [36] "bucket_name/folder_name/2021-02-05.csv"
## [37] "bucket_name/folder_name/2021-02-06.csv"
## [38] "bucket_name/folder_name/2021-02-07.csv"
## [39] "bucket_name/folder_name/2021-02-08.csv"
## [40] "bucket_name/folder_name/2021-02-09.csv"
## [41] "bucket_name/folder_name/2021-02-10.csv"
## [42] "bucket_name/folder_name/2021-02-11.csv"
## [43] "bucket_name/folder_name/2021-02-12.csv"
## [44] "bucket_name/folder_name/2021-02-13.csv"
## [45] "bucket_name/folder_name/2021-02-14.csv"
## [46] "bucket_name/folder_name/2021-02-15.csv"
## [47] "bucket_name/folder_name/2021-02-16.csv"
## [48] "bucket_name/folder_name/2021-02-17.csv"
## [49] "bucket_name/folder_name/2021-02-18.csv"
## [50] "bucket_name/folder_name/2021-02-19.csv"
## [51] "bucket_name/folder_name/2021-02-20.csv"
## [52] "bucket_name/folder_name/2021-02-21.csv"
## [53] "bucket_name/folder_name/2021-02-22.csv"
## [54] "bucket_name/folder_name/2021-02-23.csv"

To use open_dataset() we need a path to the folder holding the files. This is built with space$path(). The structure is "bucket_name/folder_name". For security, even the bucket and folder names are stored in environment variables. For this example, the real values have been substituted with "bucket_name" and "folder_name".

files_path <- space$path(glue::glue("{Sys.getenv('DO_SPACE')}/{Sys.getenv('DO_FOLDER')}"))

files_path

## SubTreeFileSystem: s3://bucket_name/folder_name/

Now we can call open_dataset() to get access to all the files in that folder. We need to specify format="csv" to let the function know the files are CSVs as opposed to parquet or other formats.

files_path <- space$path(glue::glue("{Sys.getenv('DO_SPACE')}/{Sys.getenv('DO_FOLDER')}"))


temps_dataset <- arrow::open_dataset(files_path, format='csv')

## FileSystemDataset with 66 csv files
## Name: string
## Sensor: string
## date: date32[day]
## time: string
## temperature: double
## occupancy: int64
## Thermostat: int64
## zoneAveTemp: double
## HVACmode: string
## fan: int64
## outdoorTemp: double
## outdoorHumidity: int64
## sky: int64
## wind: int64
## zoneClimate: string
## zoneCoolTemp: double
## zoneHeatTemp: double
## zoneHVACmode: string
## zoneOccupancy: int64

Printing out the temps_dataset object shows the column names along with the type of data stored in each column. {arrow} is pretty great and lets us do column selection and row filtering on the data sitting in files which opens up a whole world of data analysis on data too large to fit in memory. We are simply going to select columns of interest and collect all the data into one data.frame (actually a tibble). Notice we call select() before collect() because this reduces the number of columns being transmitted over the network.

library(dplyr)

temps_raw %>% head()

## # A tibble: 6 x 16
##   Name  Sensor date       time  temperature   fan zoneAveTemp HVACmode
##   <chr> <chr>  <date>     <chr>       <dbl> <int>       <dbl> <chr>   
## 1 Upst… Bedro… 2021-01-01 00:0…        67.8   300          70 heat    
## 2 Upst… Bedro… 2021-01-01 00:0…        72.4   300          70 heat    
## 3 Upst… Offic… 2021-01-01 00:0…        NA     300          70 heat    
## 4 Upst… Upsta… 2021-01-01 00:0…        63.1   300          70 heat    
## 5 Upst… Offic… 2021-01-01 00:0…        NA     300          70 heat    
## 6 Upst… Offic… 2021-01-01 00:0…        NA     300          70 heat    
## # … with 8 more variables: outdoorTemp <dbl>, outdoorHumidity <int>, sky <int>,
## #   wind <int>, zoneClimate <chr>, zoneHeatTemp <dbl>, zoneCoolTemp <dbl>,
## #   zoneHVACmode <chr>

Preparing the Data

The data are in need of some minor attention. First, the time column is read in as a character instead of time. This is a known issue, so instead we combine it with the date column using tidyr::unite() then convert this combination into a datetime (or timestamp) with lubridate::as_datetime(), which requires us to set a time zone. Then, the sensors named Upstairs Thermostat and Downstairs Thermostat actually represent Bedroom 3 and Dining Room, respectively, so we rename those using case_when() from {dplyr}, leaving all the other sensor names as is. Then we only keep rows where the time is earlier than the present. This is due to a quirk of the Ecobee API where it can return future readings, which happens because we request all of the latest day’s data.

temps <- temps_raw %>%
  tidyr::unite(col='time', date, time, sep=' ', remove=TRUE) %>%
  mutate(time=lubridate::as_datetime(time, tz=Sys.timezone())) %>%
  mutate(
    Sensor=case_when(
      Sensor == 'Upstairs Thermostat' ~ 'Bedroom 3',
      Sensor == 'Downstairs Thermostat' ~ 'Dining Room',
      TRUE ~ Sensor
    )
  ) %>%
  filter(time <= lubridate::now())

Skip All of This

Since the actual data may contain sensitive information a sanitized version is stored as parquet files on a public DigitalOcean space. This dataset is not as up to date but it will do for those that want to follow along.

publicspace <- arrow::S3FileSystem$create(
  # anonymous means we do not need to provide credentials
  # is crucial, otherwise the function looks for credentials
  anonymous=TRUE,
  # and crashes R if it can't find them
  scheme="https",
  # the data are stored in the nyc3 region
  endpoint_override='nyc3.digitaloceanspaces.com'
)
publicspace$ls('temperaturedata', recursive=TRUE)

## [1] "temperaturedata/2021"                 
## [2] "temperaturedata/2021/1"               
## [3] "temperaturedata/2021/1/part-0.parquet"
## [4] "temperaturedata/2021/2"               
## [5] "temperaturedata/2021/2/part-1.parquet"

publictemp <- arrow::open_dataset(publicspace$path("temperaturedata")) %>%
    collect()


publictemp %>% head()

## # A tibble: 6 x 15
##   Name  Sensor time                temperature   fan zoneAveTemp HVACmode
##   <chr> <chr>  <dttm>                    <dbl> <int>       <dbl> <chr>   
## 1 Upst… Bedro… 2021-01-01 00:00:00        67.8   300          70 heat    
## 2 Upst… Bedro… 2021-01-01 00:00:00        72.4   300          70 heat    
## 3 Upst… Offic… 2021-01-01 00:00:00        NA     300          70 heat    
## 4 Upst… Bedro… 2021-01-01 00:00:00        63.1   300          70 heat    
## 5 Upst… Offic… 2021-01-01 00:00:00        NA     300          70 heat    
## 6 Upst… Offic… 2021-01-01 00:00:00        NA     300          70 heat    
## # … with 8 more variables: outdoorTemp <dbl>, outdoorHumidity <int>, sky <int>,
## #   wind <int>, zoneClimate <chr>, zoneHeatTemp <dbl>, zoneCoolTemp <dbl>,
## #   zoneHVACmode <chr>

Visualizing with Interactive Plots

The first step in any good analysis is visualizing the data. In the past, I would have recommended {dygraphs} but have since come around to {feasts} from Rob Hyndman’s team and {timetk} from Matt Dancho due to their ease of use, better flexibility with data types and because I personally think they are more visually pleasing.

In order to show the outdoor temperature on the same footing as the sensor temperatures, we select the pertinent columns with the outdoor data from the temps tibble and stack them at the bottom of temps, after accounting for duplicates (Each sensor has an outdoor reading, which is the same across sensors).

all_temps <- dplyr::bind_rows(
  temps,
  temps %>% 
    dplyr::select(Name, Sensor, time, temperature=outdoorTemp) %>% 
    dplyr::mutate(Sensor='Outdoors', Name='Outdoors') %>% 
    dplyr::distinct()
)

Then we use plot_time_series() from {timetk} to make a plot that has a different colored line for each sensor and for the outdoors (with sensors being grouped into facets according to which thermostat they are associated with). Ordinarily the interactive version would be better (.interactive=TRUE below), but for the sake of this post static displays better.

library(timetk)
all_temps %>%
  group_by(Name) %>%
  plot_time_series(
    .date_var=time, 
    .value=temperature, .color_var=Sensor, .smooth=FALSE,
    .interactive=FALSE
  )

## Warning: Removed 34776 row(s) containing missing values (geom_path).

From this we can see a few interesting patterns. Bedroom 1 is consistently higher than the other bedrooms, but lower than the offices which are all on the same thermostat. All the downstairs rooms see their temperatures drop overnight when the thermostat is lowered to conserve energy. Between February 12 and 18 this dip was not as dramatic, due to running an experiment to see how raising the temperature on the downstairs thermostat would affect rooms on the second floor.

Computing Cross Correlations

One room in particular, Bedroom 2, always seemed colder than all the rest of the rooms, so I wanted to figure out why. Was it affected by physically adjacent rooms? Or maybe it was controlled by the downstairs thermostat rather than the upstairs thermostat as presumed.

So the first step was to see which rooms were correlated with each other. But computing correlation with time series data isn’t as simple because we need to account for lagged correlations. This is done with the cross-correlation function (ccf).

In order to compute the ccf, we need to get the data into a time series format. While R has extensive formats, most notably the built-in ts and its extension xts by Jeff Ryan, we are going to use tsibble, from Rob Hyndman and team, which are tibbles with a time component.

The tsibble object can treat multiple time series as individual data and this requires the data to be in long format. For this analysis, we want to treat multiple time series as interconnected, so we need to put the data into wide format using pivot_wider().

wide_temps <- all_temps %>% 
  # we only need these columns
  select(Sensor, time, temperature) %>% 
  # make each time series its own column
  tidyr::pivot_wider(
    id_cols=time, 
    names_from=Sensor, 
    values_from=temperature
  ) %>% 
  # convert into a tsibble
  tsibble::as_tsibble(index=time) %>% 
  # fill down any NAs
  tidyr::fill()


wide_temps

## # A tibble: 15,840 x 12
##    time                `Bedroom 2` `Bedroom 1` `Office 1` `Bedroom 3` `Office 3`
##    <dttm>                    <dbl>       <dbl>      <dbl>       <dbl>      <dbl>
##  1 2021-01-01 00:00:00        67.8        72.4         NA        63.1         NA
##  2 2021-01-01 00:05:00        67.8        72.8         NA        62.7         NA
##  3 2021-01-01 00:10:00        67.9        73.3         NA        62.2         NA
##  4 2021-01-01 00:15:00        67.8        73.6         NA        62.2         NA
##  5 2021-01-01 00:20:00        67.7        73.2         NA        62.3         NA
##  6 2021-01-01 00:25:00        67.6        72.7         NA        62.5         NA
##  7 2021-01-01 00:30:00        67.6        72.3         NA        62.7         NA
##  8 2021-01-01 00:35:00        67.6        72.5         NA        62.5         NA
##  9 2021-01-01 00:40:00        67.6        72.7         NA        62.4         NA
## 10 2021-01-01 00:45:00        67.6        72.7         NA        61.9         NA
## # … with 15,830 more rows, and 6 more variables: `Office 2` <dbl>, `Living
## #   Room` <dbl>, Playroom <dbl>, Kitchen <dbl>, `Dining Room` <dbl>,
## #   Outdoors <dbl>

Some rooms’ sensors were added after the others so there are a number of NAs at the beginning of the tracked time.

We compute the ccf using CCF() from the {feasts} package then generate a plot by piping that result into autoplot(). Besides a tsibble, CCF() needs two arguments, the columns whose cross-correlation is being computed.

library(feasts)
wide_temps %>%
  CCF(`Living Room`, `Bedroom 2`) %>%
  autoplot()

Cross-correlation between Living Room and Bedroom 2

The negative lags are when the first column is a leading indicator of the second column and the positive lags are when the first column is a lagging indicator of the second column. In this case, Living Room seems to follow Bedroom 2 rather than lead.

We want to compute the ccf between Bedroom 2 and every other room. We call CCF() on each column in wide_temps against Bedroom 2, including Bedroom 2 itself, by iterating over the columns of wide_temps with map(). We use a ~ to treat CCF() as a quoted function and wrap .x in double curly braces to take advantage of non-standard evaluation. This creates a list of objects that can be plotted with autoplot(). We make a list of ggplot objects by iterating over the ccf objects using imap(). This allows us to use the name of each element of the list in ggtitle. Lastly, wrap_plots() from {patchwork} is used to show all the plots in one display.

library(patchwork)
library(ggplot2)


room_ccf <- wide_temps %>% 
  purrr::map(~CCF(wide_temps, {{.x}}, `Bedroom 2`))


# we don't want the CCF with time, outdoor temp or the Bedroom 2 itself
room_ccf$time <- NULL
room_ccf$Outdoors <- NULL
room_ccf$`Bedroom 2` <- NULL


biggest_cor <- room_ccf %>% purrr::map_dbl(~max(.x$ccf)) %>% max()


room_ccf_plots <- purrr::imap(
  room_ccf, 
  ~ autoplot(.x) +
    # show the maximally cross-correlation value
    # so it easier to compare rooms
    geom_hline(yintercept=biggest_cor, color='red', linetype=2) + 
    ggtitle(.y) + 
    labs(x=NULL, y=NULL) + 
    ggthemes::theme_few()
)


wrap_plots(room_ccf_plots, ncol=3)

Cross-correlation between each room and Bedroom 2

This brings up an odd result. Office 1, Office 2, Office 3, Bedroom 1, Bedroom 2 and Bedroom 3 are all controlled by the same thermostat, but Bedroom 2 is only really cross-correlated with Bedroom 3 (and negatively with Office 3). Instead, Bedroom 2 is most correlated with Playroom and Living Room, both of which are controlled by a different thermostat.

So it will be interesting to see which thermostat setting (the set desired temperature) is most cross-correlated with Bedroom 2. At the same time we’ll check how much the outdoor temperature cross-correlates with Bedroom 2 as well.

temp_settings <- all_temps %>% 
  # there is a column for outdoor temperatures so we don't need these rows
  filter(Name != 'Outdoors') %>% 
  select(Name, time, HVACmode, zoneCoolTemp, zoneHeatTemp) %>% 
  # this information is repeated for each sensor, so we only need it once
  distinct() %>% 
  # depending on the setting, the set temperature is in two columns
  mutate(setTemp=if_else(HVACmode == 'heat', zoneHeatTemp, zoneCoolTemp)) %>% 
  tidyr::pivot_wider(id_cols=time, names_from=Name, values_from=setTemp) %>% 
  # get the reading data
  right_join(wide_temps %>% select(time, `Bedroom 2`, Outdoors), by='time') %>% 
  as_tsibble(index=time)


temp_settings

## # A tibble: 15,840 x 5
##    time                Upstairs Downstairs `Bedroom 2` Outdoors
##    <dttm>                 <dbl>      <dbl>       <dbl>    <dbl>
##  1 2021-01-01 00:00:00       71         72        67.8     31.3
##  2 2021-01-01 00:05:00       71         72        67.8     31.3
##  3 2021-01-01 00:10:00       71         72        67.9     31.3
##  4 2021-01-01 00:15:00       71         66        67.8     31.3
##  5 2021-01-01 00:20:00       71         66        67.7     31.3
##  6 2021-01-01 00:25:00       71         66        67.6     31.3
##  7 2021-01-01 00:30:00       71         66        67.6     31.2
##  8 2021-01-01 00:35:00       71         66        67.6     31.2
##  9 2021-01-01 00:40:00       71         66        67.6     31.2
## 10 2021-01-01 00:45:00       71         66        67.6     31.2
## # … with 15,830 more rows

Making this tsibble longer allows us to use the built-in grouping and iteration that comes with CCF() and other functions in {fable}.

temp_settings_long <- temp_settings %>% 
  tidyr::pivot_longer(
    cols=-c(time, `Bedroom 2`), 
    names_to='Thermostat', values_to='Temperature'
  )

room_thermostat_ccf <- temp_settings_long %>% 
  CCF(Temperature, `Bedroom 2`)

room_thermostat_ccf %>% autoplot() + ggthemes::theme_few()

Cross-correlation between Bedroom 2 and thermostat settings and the outdoor temperature

This plot makes a strong argument for Bedroom 2 being more affected by the downstairs thermostat than the upstairs like we presume. But I don’t think the downstairs thermostat is actually controlling the vents in Bedroom 2 because I have looked at the duct work and followed the path (along with a professional HVAC technician) from the furnace to the vent in Bedroom 2. What I think is more likely is that the rooms downstairs get so cold (I lower the temperature overnight to conserve energy), and there is not much insulation between the floors, so the vent in Bedroom 2 can’t pump enough warm air to compensate.

I did try to experiment for a few nights (February 12 and 18) by not lowering the downstairs temperature but the eyeball test didn’t reveal any impact. Perhaps a proper A/B test is called for.

Fitting Time Series Models with `{fable}`

Going beyond cross-correlations, an ARIMA model with exogenous variables can give an indication if input variables have a significant impact on a time series while accounting for autocorrelation and seasonality. We want to model the temperature in Bedroom 2 while using all the other rooms, the outside temperature and the thermostat settings to see which affected Bedroom 2 the most. We fit a number of different models then compare them using AICc (AIC corrected) to see which fits the best.

To do this we need a tsibble with each of these variables as a column. This requires us to join temp_settings with wide_temps.

all_ts <- wide_temps %>% 
  left_join(
    temp_settings %>% select(time, Upstairs, Downstairs),
    by='time'
  )

Now we use model() from {fable} (technically from {fabletools}) to fit a number of models. We make it run in parallel with plan() from {future}.

library(future)
plan(multisession, workers=10)

library(tictoc)

library(fable)

tic()
ts_mods <- all_ts %>% 
  model(
    mod0=ARIMA(`Bedroom 2`)
    , mod1=ARIMA(`Bedroom 2` ~ `Bedroom 1`)
    , mod2=ARIMA(`Bedroom 2` ~ `Bedroom 3`)
    , mod3=ARIMA(`Bedroom 2` ~ `Bedroom 1` + `Bedroom 3`)
    , mod4=ARIMA(`Bedroom 2` ~ `Living Room`)
    , mod5=ARIMA(`Bedroom 2` ~ Playroom)
    , mod6=ARIMA(`Bedroom 2` ~ `Living Room` + Playroom)
    , mod7=ARIMA(`Bedroom 2` ~ `Living Room` + Playroom + 
                   Kitchen + `Dining Room`
    )
    , mod8=ARIMA(`Bedroom 2` ~ `Bedroom 1` + `Bedroom 3` + 
                   `Living Room` + Playroom + 
                   Kitchen + `Dining Room`
    )
    , mod9=ARIMA(`Bedroom 2` ~ Outdoors)
    , mod10=ARIMA(`Bedroom 2` ~ Upstairs)
    , mod11=ARIMA(`Bedroom 2` ~ Downstairs)
    , mod12=ARIMA(`Bedroom 2` ~ Downstairs + Upstairs)
    , mod13=ARIMA(`Bedroom 2` ~ Outdoors + Downstairs + Upstairs)
    , mod14=ARIMA(`Bedroom 2` ~ Outdoors + 
                    Downstairs + Upstairs + 
                    `Living Room` + Playroom + 
                    Kitchen + `Dining Room`
    )
    , mod15=ARIMA(`Bedroom 2` ~ Outdoors + 
                    Downstairs + Upstairs + 
                    `Bedroom 1` + `Bedroom 3`
    )
    , mod16=ARIMA(`Bedroom 2` ~ Outdoors + 
                    Downstairs + Upstairs + 
                    `Bedroom 1` + `Bedroom 3` +
                    `Living Room` + Playroom + 
                    Kitchen + `Dining Room`
    )
  )
toc()

## 99.41 sec elapsed

This results in a tibble of models (termed a mable) with one column per model. We use glance() to see the AICc for each, along with other information.

ts_mods %>% 
  glance() %>% 
  arrange(AICc)

## # A tibble: 17 x 8
##    .model sigma2 log_lik     AIC    AICc     BIC ar_roots  ma_roots  
##    <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <list>    <list>    
##  1 mod3   0.0247   6723. -13427. -13427. -13358. <cpl [3]> <cpl [14]>
##  2 mod1   0.0247   6714. -13411. -13411. -13350. <cpl [3]> <cpl [14]>
##  3 mod15  0.0247   6709. -13394. -13394. -13302. <cpl [3]> <cpl [3]> 
##  4 mod4   0.0257   6393. -12768. -12768. -12699. <cpl [3]> <cpl [25]>
##  5 mod2   0.0264   6199. -12381. -12381. -12320. <cpl [2]> <cpl [15]>
##  6 mod9   0.0264   6190. -12365. -12365. -12304. <cpl [2]> <cpl [26]>
##  7 mod10  0.0264   6190. -12365. -12365. -12303. <cpl [2]> <cpl [26]>
##  8 mod0   0.0264   6189. -12365. -12365. -12311. <cpl [2]> <cpl [26]>
##  9 mod13  0.0264   6176. -12331. -12331. -12255. <cpl [2]> <cpl [26]>
## 10 mod12  0.0264   6175. -12331. -12331. -12262. <cpl [2]> <cpl [26]>
## 11 mod11  0.0264   6174. -12331. -12331. -12270. <cpl [2]> <cpl [26]>
## 12 mod6   0.0195   4940.  -9863.  -9863.  -9802. <cpl [5]> <cpl [0]> 
## 13 mod5   0.0195   4932.  -9849.  -9849.  -9796. <cpl [5]> <cpl [0]> 
## 14 mod8   0.0163   4603.  -9184.  -9184.  -9100. <cpl [4]> <cpl [0]> 
## 15 mod16  0.0163   4605.  -9181.  -9181.  -9074. <cpl [4]> <cpl [0]> 
## 16 mod7   0.0171   4364.  -8708.  -8708.  -8631. <cpl [5]> <cpl [0]> 
## 17 mod14  0.0171   4365.  -8704.  -8704.  -8605. <cpl [5]> <cpl [0]>

ts_mods %>% 
  glance() %>% 
  # sort .model according to AICc for better plotting
  mutate(.model=forcats::fct_reorder(.model, .x=AICc, .fun=sort)) %>% 
  # smaller is better for AICc
  # so negate it so that the tallest bar is best
  ggplot(aes(x=.model, y=-AICc)) + 
  geom_col() + 
  theme(axis.text.x=element_text(angle=25))

The best two models (as judged by AICc) are mod3 and mod1, some of the simpler models. The key thing they have in common is that Bedroom 1 is a predictor, which somewhat contradicts the findings from the cross-correlation function. We can examine mod3 by selecting it out of ts_mods then calling report().

ts_mods %>% 
  select(mod3) %>% 
  report()

## Series: Bedroom 2 
## Model: LM w/ ARIMA(3,1,2)(0,0,1)[12] errors 
## 
## Coefficients:
##          ar1      ar2     ar3     ma1     ma2     sma1  `Bedroom 1`
##       0.4497  -0.5184  0.0446  0.0650  0.3160  -0.0049       0.2723
## s.e.  0.0593   0.0471  0.0265  0.0589  0.0274   0.0082       0.0082
##       `Bedroom 3`
##            0.0396
## s.e.       0.0092
## 
## sigma^2 estimated as 0.02469:  log likelihood=6722.64
## AIC=-13427.28   AICc=-13427.27   BIC=-13358.25

This shows that Bedroom 1 and Bedroom 3 are significant predictors with SARIMA(3,1,2)(0,0,1) errors.

The third best model is one of the more complicated models, mod15. Not all the predictors are significant, though we should not judge a model, or the predictors, based on that.

ts_mods %>% 
  select(mod15) %>% 
  report()

## Series: Bedroom 2 
## Model: LM w/ ARIMA(3,1,3) errors 
## 
## Coefficients:
##          ar1      ar2      ar3     ma1     ma2     ma3  Outdoors  Downstairs
##       0.1870  -0.4426  -0.0724  0.3288  0.3745  0.0886   -0.0037     -0.0015
## s.e.  0.2351   0.0928   0.1058  0.2352  0.0490  0.0692    0.0026      0.0025
##       Upstairs  `Bedroom 1`  `Bedroom 3`
##         0.0087       0.2718       0.0387
## s.e.    0.0066       0.0081       0.0092
## 
## sigma^2 estimated as 0.02468:  log likelihood=6709.22
## AIC=-13394.43   AICc=-13394.41   BIC=-13302.39

We can visualize this model with {coefplot} after stripping out the time series coefficients.

library(coefplot)
ts_mods %>% 
    select(mod15) %>% 
    coef() %>% 
    filter(!stringr::str_detect(term, '^ar|ma\\d+$')) %>%
    select(Coefficient=term, Value=estimate, SE=std.error, Model=.model) %>%
    mutate(
        LowOuter=Value - 2*SE, LowInner=Value - 1*SE,
        HighInner=Value + 1*SE, HighOuter=Value + 2*SE
    ) %>%
    arrange(Value) %>%
    mutate(Coefficient=factor(Coefficient, levels=Coefficient)) %>%
    coefplot()

What all these top performing models have in common is the inclusion of the other bedrooms as predictors.

While helping me interpret the data, my wife pointed out that all the colder rooms are on one side of the house while the warmer rooms are on the other side. That colder side is the one that gets hit by winds which are rather strong. This made me remember a long conversation I had with a very friendly and knowledgeable sales person at GasTec who said that wind can significantly impact a house’s heat due to it blowing away the thermal envelope. I wanted to test this idea, but while the Ecobee API returns data on wind, the value is always 0, so it doesn’t tell me anything. Hopefully, I can get that fixed.

Running Everything with `{targets}`

There were many moving parts to the analysis, so I managed it all with {targets}. Each step seen in this post is a separate target in the workflow. There is even a target to build an html report from an R Markdown file. The key is to load objects created by the workflow into the R Markdown document with tar_read() or tar_load(). Then tar_render() will figure out how the R Markdown document figures into the computational graph.

tar_render(
  results,
  'Analysis.Rmd'
)

I went over {targets} more thoroughly in the previous post and I recommend Will Landau’s talk at the New York Open Statistical Programming Meetup to learn more.

Reporting with RStudio Connect

In addition to the R Markdown I made a shiny dashboard using the R Markdown flexdashboard format. I host this on RStudio Connect so I can get a live view of the data and examine the models. Besides from being a partner with RStudio, I really love this product and I use it for all of my internal reporting (and side projects) and we have a whole practice around getting companies up to speed with the tool.

What Comes Next?

So what can be done with this information? Based on the ARIMA model the other bedrooms on the same floor are predictive of Bedroom 2, which makes sense since they are controlled by the same thermostat. Bedroom 3 is consistently one of the coldest in the house and shares a wall with Bedroom 2, so maybe that wall could use some insulation. Bedroom 1 on the other hand is the hottest room on that floor, so my idea is to find a way to cool off that room (we sealed the vents with magnetic covers), thus lowering the average temperature on the floor, causing the heat to kick in, raising the temperature everywhere, including Bedroom 2.

The cross-correlation functions also suggest a relationship with Playroom and Living Room, which are some of the colder rooms downstairs and the former is directly underneath Bedroom 2. I’ve already put caulking cord in the gaps around the floorboards in the Playroom and that has noticeably raised the temperature in there, so hopefully that will impact Bedroom 2. We already have storm windows on just about every window and have added thermal curtains to Bedroom 2, Bedroom 3 and the Living Room. My wife and I also discovered air billowing in from around, not in, the fireplace in the Living Room, so if we can close that source of cold air, that room should get warmer then hopefully Bedroom 2 as well. Lastly, we want to insulate between the first floor ceiling and second floor.

While we make these fixes, we’ll keep collecting data and trying new ways to analyze it all. It will be interesting to see how the data change in the summer.

These three posts were intended to show the entire process from data collection to automation (what some people are calling robots) to analysis, and show how data can be used to make actionable decisions. This is one more way I’m improving my life with data.

The inaugural Government & Public Sector R Conference took place virtually from December 2nd to December 4th. With over 240 attendees, 26 speakers, three panelists and a rum masterclass class leader, the R|Gov conference was a place where data scientists could gather remotely to explore, share, and inspire ideas.

We had so many amazing speakers, whom we would like to thank: Lucy D’Agostino McGowan (Wake Forest University), Dr. Andrew Gelman (Columbia University), Dr. Graciela Chichilnisky (Global Thermostat), Dr. David Meza (NASA), Maj. Maxine Drake (US Army), Alex Gold (RStudio), Kimberly F. Sellers (Georgetown University; The U. S. Census Bureau), Dr. Tyler Morgan-Wall (Institute for Defense Analyses (IDA)), Imane El Idrissi & Dr. Anna Mantsoki (Foundation for Innovative New Diagnostics), Dr. Wendy Martinez (Bureau of Labor Statistics (BLS)), Col. Alfredo Corbett (US Air Force), Rose Martinez & Brooke Frye (New York City Council Data Team), Yvan Gauthier (Department of National Defence), Michael Jadoo (BLS), Tommy Jones (In-Q-Tel), Selina Carter (IDB), Refael Lav (Deloitte’s Federal Government Services teams), Dr. Abhijit Dasgupta (Zansors), Dr. Simina Boca (Georgetown University Medical Center), Dr. Wil Doane (IDA), Mo Johnson-León (Insight Lane), Dan Chen (Virginia Tech), Dr. Gwynn Sturdevant (HBS & R-Ladies DC), Marck Vaisman (Microsoft), Jonathan Hersh (Argyros School of Business), Kaz Sakamoto (Lander Analytics & Columbia University), Emily Martinez (NYC Department of Health and Mental Hygiene), Dan Whitenack (SIL International & Practical AI Podcast), Danya Murali (Arcadia Power), Malcolm Barrett (Teladoc Health) and myself.

All the talks will be shared on rstats.ai and the Lander Analytics YouTube channel in the very near future. Stay tuned!

Check out some of the highlights from the conference:

Graciela Chichilnisky explains how financial instruments can resolve climate change

One of my former professors at Columbia University, Dr. Graciela Chichilnisky, gave a presentation on how financial instruments can resolve climate change quickly and effectively by using existing capital markets to benefit high—and, especially, low—income groups. The process Dr. Chichilnisky proposes is simple and can lead to a transformation of our capitalistic economy in the direction of human survival. Furthermore, it is realistic and is profitable. Dr. Chichilnisky acted as the lead U.S. author on the Intergovernmental panel on Climate Change, which received the 2007 Nobel Prize for its work in deciding world policy with respect to climate change, and she worked extensively on the Kyoto Protocol, creating and designing the carbon market that became international law in 2005.

Another classic no-slides talk from Andrew Gelman on how his team and The Economist Magazine built a presidential election forecasting model

Another professor of mine, Andrew Gelman told us he wanted to give a talk on how his team’s election forecasting succeeded brilliantly, failed miserably, or landed somewhere in between. To build the model, they combined national polls, state polls, and political and economic fundamentals. Because we didn’t know the results of the election at the time, he didn’t know which of the three he’d be talking about… So how did his election forecast perform? The model predicted 49 out of 50 states correctly… But that doesn’t mean the forecast was perfect… For some background, see this article.

Wendy Martinez inspires and shares lessons about the rocky road she traveled to using R at a U.S. Government agency

Wendy Martinez described some of her experiences — both successes and failures — using R at several U.S. government agencies. In addition to serving as the Director of the Mathematical Statistics Research Center at the Bureau of Labor Statistics (BLS) for the last eight years, she is currently the President of the American Statistical Association (ASA), and she also served in several research positions throughout the Department of Defense. She has also written two books on MATLAB! It’s nice to see that she switched to open source.

Colonel Alfredo Corbett Spoke On Air Combat Command Enterprise Data Improvements

Deputy Director of Communications of the United States Air Force Colonel Alfedo Corbett showed us why, in his work, data can be a warfighting asset, fundamental to how Air Combat Command (ACC) operates in—and supports—all domains of warfare. In coordination with the Department of Defense and the Department of the Air Force, ACC is working to improve its data governance, data architecture, data standards, and data talent & culture, implementing major improvements to the way it manages, acquires, ingests, stores, processes, exploits, analyzes, and delivers data to its almost 100,000 operators.

We Participated in Two Virtual Happy Hours!

At lunch on the first day of the conference, we took a dive into the history and distillation process of a legendary rum made at the longest continuously running distillery in the world, Mount Gay Brand Ambassador Darrio Prescod shared his knowledge and transported us to Barbados (where he tuned in from virtually). Following the second day of the conference, members of the Mount Gay brand development team took us through a rum tasting and shook up a couple of cocktails. Attendees and speakers listened and hung out, drinking rum, matcha, soda or water during our virtual happy hour.

All proceeds from the A(R)T Auction went to the R Foundation

The A(R)T Auction was held in support of the R Foundation, featuring pieces by artists in the R Community. Artists included Nadieh Bremer (left), Selina Carter, Thomas Lin Pedersen (right), Will Chase, DiKayo Data.

Traditional R-Ladies Group Photo Happened Again

We took an R-Ladies group [virtual] selfie. We would like to note that more R-Ladies participated, but chose not to share video.:

Jon Harmon, Selina Carter, Mayarí Montes de Oca & DiKayo Data win Raspberry Pis, Noise Cancelling Headphones, and Gaming Mechanical Keyboards for Most Active Tweeting
You can see the R|Gov 2020 R Shiny Scoreboard here! A custom started at DCR 2018 by our Twitter scorekeeper Malorie Hughes (@data_all_day), has returned every year by popular demand. Congratulations to our winners!

52 Conference Attendees Participated in Pre-Conference Workshops

We ran the following workshops prior to the conference:

Causal Inference in R with Lucy D’Agostino McGowan and Malcolm Barret
Geospatial Statistics and Mapping in R with Kaz Sakamoto
Introduction to Machine Learning for Public Policy with Jonathan Hersh

Moving from DCR to R|Gov
With the shift to remote, we realized we could welcome a global audience to our annual conference, as we did for the virtual New York R Conference in August. And that gave birth to R|Gov, the Government and Public Sector R Conference. This new industry-focused conference focused on work in government, defense, NGOs and the public sector, and we have speakers from not only the DC-area, but also from Geneva, Switzerland, Nashville, Tennessee, Quebec, Canada and Los Angeles, California. For next year, we are working to invite speakers from more levels of government–local, state and federal. You can read more about this choice here.

Like NYR, R|Gov featured many in-person components of the gathering, like networking sessions, speaker walk-on songs and fun facts, happy hours, lots of giveaways, the Twitter contest, and the auction.

Thank you, Lander Analytics Team!

Even though it was virtual, there was a lot of work that went into the conference, and I want to thank my amazing team at Lander Analytics along with our producer, Bill Prickett, for making it all come together.

Looking Forward to New York, R|Gov, and Dublin!

If you attended, we hope you had an incredible experience. If you did not attend this year’s conference, we hope to see you at the at the New York R Conference and R|Gov in 2021, and, soon, the first Dublin R Conference.

The sixth annual (and first virtual) “New York” R Conference took place August 5-6 & 12-15. Almost 300 attendees, and 30 speakers, plus a stand-up comedian and a whiskey masterclass leader, gathered remotely to explore, share, and inspire ideas.

We had many awesome speakers, many new and some, returning: Dr. Rob J Hyndman (Monash University), Dr. Adam Obeng (Facebook), Ludmila Janda (Amplify), Emily Robinson (Warby Parker), Daniel Chen (Virginia Tech, Lander Analytics), Dr. Jon Krohn (untapt), Dr. Andrew Gelman (Columbia University), David Smith (Microsoft), Laura Gabrysiak (Visa), Brooke Watson (ACLU), Dr. Sebastian Teran Hidalgo (Vroom), Catherine Zhou (Codecademy), Dr. Jacqueline Nolis (Brightloom), Sonia Ang (Microsoft), Emily Dodwell (AT&T Labs Research), Jonah Gabry (Columbia University, Stan Development Team), Wes MckKinney and Dr. Neal Richardson (Ursa Labs), Dr. Thomas Mock (RStudio), Dr. David Robinson, (Heap), Dr. Max Kuhn (RStudio), Dr. Erin LeDell (H2O.ai), Monica Thieu (Columbia University), Camelia Hssaine (Codecademy), and myself and, coming soon, a bonus talk by Heather Nolis (T-Mobile) which will be shared on YouTube as soon as our team is done editing them, along with all the other talks.

Let’s take a look at some of the highlights from the conference:

Andrew Gelman Gave Another 40-Minute Talk (no slides, as always)

Our favorite quotes from Andrew Gelman’s talk, Truly Open Science: From Design and Data Collection to Analysis and Decision Making, which had no slides, as usual:

“Everyone training in statistics becomes a teacher.”

“The most important thing you should take away — put multiple graphs on a page.”

“Honesty and transparency are not enough.”

“Bad science doesn’t make someone a bad person.”

Laura Gabrysiak Shows us We Are Driven By Experience, and not Brand Loyalty…Hope you Folks had a Good Experience!

Laura’s talk on re-Inventing customer engagement with machine learning went through several interesting use cases from her time at Visa. In addition to being a data scientist, she is an active community organizer and the co-founder of R-Ladies Miami.

Adam Obeng Delivered a Talk on Adaptive Experimentation

One of my former students at Columbia University, Adam Obeng, gave a great presentation on his adaptive experimentation. We learned that adaptive experimentation is three things: The name of (1) a family of techniques, (2) Adam’s team at Facebook, and (3) an open source package produced by said team. He went through the applications which are hyper-parameter optimization for ML, experimentation with multiple continuous treatments, and physical experiments or manufacturing.

Dr. Jacqueline Nolis Invited Us to Crash Her Viral Website, Tweet Mashup

Jacqueline asked the crowd to crash her viral website,Tweet Mashup, and gave a great talk on her experience building it back in 2016. Her website that lets you combine the tweets of two different people. After spending a year making it in .NET, when she launched the site it became an immediate sensation. Years later, she was getting more and more frustrated maintaining the F# code and decided to see if I could recreate it in Shiny. Doing so would require having Shiny integrate with the Twitter API in ways that hadn’t been done by anyone before, and pushing the Twitter API beyond normal use cases.

Attendees Participated in Two Virtual Happy Hours Packed with Fun

At the Friday Happy Hour, we had a mathematical standup comedian for the first time in R Conference history. Comic and math major Rachel Lander (no relationship to me!) entertained us with awesome math and stats jokes.

Following the stand up, we had a Whiskey Master Class with our Vibe Sponsor Westland Distillery, and another one on Saturday with Bruichladdich Distillery (hard to pronounce and easy to drink). Attendees and speakers learned and drank together, whether it be their whiskey, matchas, soda or water.

All Proceeds from the A(R)T Auction went to the R Foundation Again

A newer tradition, the A(R)T Auction, took place again! We featured pieces by artists in the R Community, and all proceeds were donated to the R Foundation. The highest-selling piece at auction was Street Cred (2020) by Vivian Peng (Lander Analytics and Los Angeles Mayor’s Office, Innovation Team). The second highest was a piece by Jacqueline Nolis (Brightloom, and Build a Career in Data Science co-author), R Conference speaker, Designed by Allison Horst, artist in residence at RStudio.

The R-Ladies Group Photo Happened, Even Remotely!

As per tradition, we took an R-Ladies group photo, but, for the first time, remotely– as a screenshot! We would like to note that many more R-Ladies were present in the chat, but just chose not to share video.

Jon Harmon, Edna Mwenda, and Jessica Streeter win Raspberri Pis, Bluetooth Headphones, and Tenkeyless Keyboards for Most Active Tweeting During the Conference

This year’s Twitter Contest, in Malorie’s words, was a “ruthless but noble war.” You can see the NYR 2020 Dashboard here. A custom started that DCR 2018 by our Twitter scorekeeper Malorie Hughes (@data_all_day) has returned every year by popular demand, and now she’s stuck with it forever! Congratulations to our winners!

50+ Conference Attendees Participated in Pre-Conference Workshops Before

For the first time ever, workshops took place over the course of several days to promote work-life balance, and to give attendees the chance to take more than one course. We ran the following seven workshops:

Machine Learning in R with Max Kuhn
Machine learning in Python with scikit-learn with Andreas Mueller

Data Visualization with Shiny Workshop with Vivian Peng
Geospatial Statistics and Mapping in R with Kaz Sakamoto
Git for Data Science with Daniel Chen

Exploring Data with the Tidyverse with David Robinson
Web Scraping with Selenium with Amanda Dobbyn

Recreating the In-Person Experience

We recreated as much of the in-person experience as possible with attendee networking sessions, the speaker walk-on songs and fun facts, abundant prizes and giveaways, the Twitter contest, an art auction, and happy hours. In addition to all of this, we mailed conference programs, hex stickers, and other swag to each attendee (in the U.S.), along with discount codes from our Vibe Sponsors, MatchaBar, Westland Distillery and Bruichladdich Distillery.

Thank you, Lander Analytics Team!

Looking Forward to D.C. and Dublin
If you attended, we hope you had an incredible experience. If you did not, we hope to see you at the virtual DC R Conference in the fall, and at the first Dublin R Conference and the NYR next year!

The costs involved with a health insurance plan can be confusing so I perform an analysis of different options to find which plan is most cost effective

My wife and I recently brought a new R programmer into our family so we had to update our health insurance. Becky is a researcher in neuroscience and psychology at NYU so we decided to choose an NYU insurance plan.

Our son sporting a New York R Conference shirt.

For families there are two main plans: Value and Advantage. The primary differences between the plans are the following:

Item	Explanation	Value Plan Amount	Advantage Plan Amount
Bi-Weekly Premiums	The amount we pay every other week in order to have insurance	$160 ($4,160 annually)	$240 ($6,240 annually)
Deductible	Amount we pay directly to health providers before the insurance starts covering costs	$1,000	$800
Coinsurance	After the deductible is met, we pay this percentage of medical bills	20%	10%
Out-of-Pocket Maximum	This is the most we will have to pay to health providers in a year (premiums do not count toward this max)	$6,000	$5,000

We put them into a tibble for use later.

# use tribble() to make a quick and dirty tibble
parameters <- tibble::tribble(
    ~Plan, ~Premiums, ~Deductible, ~Coinsurance, ~OOP_Maximum,
    'Value', 160*26, 1000, 0.2, 6000,
    'Advantage', 240*26, 800, 0.1, 5000
)

Other than these cost differences, there is not any particular benefit of either plan over the other. That means whichever plan is cheaper is the best to choose.

This blog post walks through the steps of evaluating the plans to figure out which to select. Code is included so anyone can repeat, and improve on, the analysis for their given situation.

Cost

In order to figure out which plan to select we need to figure out the all-in cost, which is a function of how much we spend on healthcare in a year (we have to estimate our annual spending) and the aforementioned premiums, deductible, coinsurance and out-of-pocket maximum.

\[ \text{cost} = f(\text{spend}; \text{premiums}, \text{deductible}, \text{coinsurance}, \text{oop_maximum}) = \\ \text{min}(\text{oop_maximum}, \text{deductible} + \text{coinsurance}*(\text{spend}-\text{deductible}))+\text{premiums} \]

This can be written as an R function like this.

#' @title cost
#' @description Given healthcare spend and other parameters, calculate the actual cost to the user
#' @details Uses the formula above to caluclate total costs given a certain level of spending. This is the premiums plus either the out-of-pocket maximum, the actual spend level if the deductible has not been met, or the amount of the deductible plus the coinsurance for spend above the deductible but below the out-of-pocket maximum.
#' @author Jared P. Lander
#' @param spend A given amount of healthcare spending as a vector for multiple amounts
#' @param premiums The annual premiums for a given plan
#' @param deductible The deductible for a given plan
#' @param coinsurance The coinsurance percentage for spend beyond the deductible but below the out-of-pocket maximum
#' @param oop_maximum The maximum amount of money (not including premiums) that the insured will pay under a given plan
#' @return The total cost to the insured
#' @examples
#' cost(3000, 4160, 1000, .20, 6000)
#' cost(3000, 6240, 800, .10, 5000)
#'
cost <- function(spend, premiums, deductible, coinsurance, oop_maximum)
{
    # spend is vectorized so we use pmin to get the min between oop_maximum and (deductible + coinsurance*(spend - deductible)) for each value of spend provided
    pmin(
      # we can never pay more than oop_maximum so that is one side
      oop_maximum, 
      # if we are under oop_maximum for a given amount of spend,
      # this is the cost
      pmin(spend, deductible) + coinsurance*pmax(spend - deductible, 0)
    ) + 
    # we add the premiums since that factors into our cost
    premiums
}

With this function we can see if one plan is always, or mostly, cheaper than the other plan and that’s the one we would choose.

R Packages

For the rest of the code we need these R packages.

library(dplyr)
library(ggplot2)
library(tidyr)
library(formattable)
library(readr)
library(readxl)

Spending

To see our out-of-pocket cost at varying levels of healthcare spend we build a grid in $1,000 increments from $1,000 to $70,000.

spending <- tibble::tibble(Spend=seq(1000, 70000, by=1000))

We call our cost function on each amount of spend for the Value and Advantage plans.

spending <- spending %>% 
    # use our function to calcuate the cost for the value plan
    mutate(Value=cost(
        spend=Spend, 
        premiums=parameters$Premiums[1], 
        deductible=parameters$Deductible[1], 
        coinsurance=parameters$Coinsurance[1], 
        oop_maximum=parameters$OOP_Maximum[1]
    )
    ) %>% 
    # use our function to calcuate the cost for the Advantage plan
    mutate(Advantage=cost(
        spend=Spend, 
        premiums=parameters$Premiums[2], 
        deductible=parameters$Deductible[2], 
        coinsurance=parameters$Coinsurance[2], 
        oop_maximum=parameters$OOP_Maximum[2]
    )
    ) %>% 
  # compute the difference in costs for each plan
  mutate(Difference=Advantage-Value) %>% 
  # the winner for a given amount of spend is the cheaper plan
  mutate(Winner=if_else(Advantage < Value, 'Advantage', 'Value'))

The results are in the following table, showing every other row to save space. The Spend column is a theoretical amount of spending with a red bar giving a visual sense for the increasing amounts. The Value and Advantage columns are the corresponding overall costs of the plans for the given amount of Spend. The Difference column is the result of Advantage – Value where positive numbers in blue mean that the Value plan is cheaper while negative numbers in red mean that the Advantage plan is cheaper. This is further indicated in the Winner column which has the corresponding colors.

Spend	Value	Advantage	Difference	Winner
$2,000	$5,360	$7,160	1800	Value
$4,000	$5,760	$7,360	1600	Value
$6,000	$6,160	$7,560	1400	Value
$8,000	$6,560	$7,760	1200	Value
$10,000	$6,960	$7,960	1000	Value
$12,000	$7,360	$8,160	800	Value
$14,000	$7,760	$8,360	600	Value
$16,000	$8,160	$8,560	400	Value
$18,000	$8,560	$8,760	200	Value
$20,000	$8,960	$8,960	0	Value
$22,000	$9,360	$9,160	-200	Advantage
$24,000	$9,760	$9,360	-400	Advantage
$26,000	$10,160	$9,560	-600	Advantage
$28,000	$10,160	$9,760	-400	Advantage
$30,000	$10,160	$9,960	-200	Advantage
$32,000	$10,160	$10,160	0	Value
$34,000	$10,160	$10,360	200	Value
$36,000	$10,160	$10,560	400	Value
$38,000	$10,160	$10,760	600	Value
$40,000	$10,160	$10,960	800	Value
$42,000	$10,160	$11,160	1000	Value
$44,000	$10,160	$11,240	1080	Value
$46,000	$10,160	$11,240	1080	Value
$48,000	$10,160	$11,240	1080	Value
$50,000	$10,160	$11,240	1080	Value
$52,000	$10,160	$11,240	1080	Value
$54,000	$10,160	$11,240	1080	Value
$56,000	$10,160	$11,240	1080	Value
$58,000	$10,160	$11,240	1080	Value
$60,000	$10,160	$11,240	1080	Value
$62,000	$10,160	$11,240	1080	Value
$64,000	$10,160	$11,240	1080	Value
$66,000	$10,160	$11,240	1080	Value
$68,000	$10,160	$11,240	1080	Value
$70,000	$10,160	$11,240	1080	Value

Of course, plotting often makes it easier to see what is happening.

spending %>% 
    select(Spend, Value, Advantage) %>% 
    # put the plot in longer format so ggplot can set the colors
    gather(key=Plan, value=Cost, -Spend) %>% 
    ggplot(aes(x=Spend, y=Cost, color=Plan)) + 
        geom_line(size=1) + 
        scale_x_continuous(labels=scales::dollar) + 
        scale_y_continuous(labels=scales::dollar) + 
        scale_color_brewer(type='qual', palette='Set1') + 
        labs(x='Healthcare Spending', y='Out-of-Pocket Costs') + 
        theme(
          legend.position='top',
          axis.title=element_text(face='bold')
        )

Plot of out-of-pocket costs as a function of actual healthcare spending. Lower is better.

It looks like there is only a small window where the Advantage plan is cheaper than the Value plan. This will be more obvious if we draw a plot of the difference in cost.

spending %>% 
    ggplot(aes(x=Spend, y=Difference, color=Winner, group=1)) + 
        geom_hline(yintercept=0, linetype=2, color='grey50') + 
        geom_line(size=1) + 
        scale_x_continuous(labels=scales::dollar) + 
        scale_y_continuous(labels=scales::dollar) + 
        labs(
          x='Healthcare Spending', 
          y='Difference in Out-of-Pocket Costs Between the Two Plans'
        ) + 
        scale_color_brewer(type='qual', palette='Set1') + 
        theme(
          legend.position='top',
          axis.title=element_text(face='bold')
        )

Plot of the difference in overall cost between the Value and Advantage plans. A difference greater than zero means the Value plan is cheaper and a difference below zero means the Advantage plan is cheaper.

To calculate the exact cutoff points where one plan becomes cheaper than the other plan we have to solve for where the two curves intersect. Due to the out-of-pocket maximums the curves are non-linear so we need to consider four cases.

The spending exceeds the point of maximum out-of-pocket spend for both plans
The spending does not exceed the point of maximum out-of-pocket spend for either plan
The spending exceeds the point of maximum out-of-pocket spend for the Value plan but not the Advantage plan
The spending exceeds the point of maximum out-of-pocket spend for the Advantage plan but not the Value plan

When the spending exceeds the point of maximum out-of-pocket spend for both plans the curves are parallel so there will be no cross over point.

When the spending does not exceed the point of maximum out-of-pocket spend for either plan we set the cost calculations (not including the out-of-pocket maximum) for each plan equal to each other and solve for the amount of spend that creates the equality.

To keep the equations smaller we use variables such as $d_v$ for the Value plan deductible, $c_a$ for the Advantage plan coinsurance and $oop_v$ for the out-of-pocket maximum for the Value plan.

\[ d_v + c_v(S – d_v) + p_v = d_a + c_a(S – d_a) + p_a \\ c_v(S – D_v) – c_a(S-d_a) = d_a – d_v + p_a – p_v \\ c_vS – c_vd_v – c_aS + c_ad_a = d_a – d_v + p_a – p_v \\ S(c_v – c_a) = d_a – c_ad_a – d_v + c_vd_v + p_a – p_v \\ S(c_v – c_a) = d_a(1 – c_a) – d_v(1 – c_v) + p_a – p_v \\ S = \frac{d_a(1 – c_a) – d_v(1 – c_v) + p_a – p_v}{(c_v – c_a)} \]

When the spending exceeds the point of maximum out-of-pocket spend for the Value plan but not the Advantage plan, we set the out-of-pocket maximum plus premiums for the Value plan equal to the cost calculation of the Advantage plan.

\[ oop_v + p_v = d_a + c_a(S – d_a) + p_a \\ d_a + c_a(S – d_a) + p_a = oop_v + p_v \\ c_aS – c_ad_a = oop_v + p_v – p_a – d_a \\ c_aS = oop_v + p_v – p_a + c_ad_a – d_a \\ S = \frac{oop_v + p_v – p_a + c_ad_a – d_a}{c_a} \]

When the spending exceeds the point of maximum out-of-pocket spend for the Advantage plan but not the Value plan, the solution is just the opposite of the previous equation.

\[ oop_a + p_a = d_v + c_v(S – d_v) + p_v \\ d_v + c_v(S – d_v) + p_v = oop_a + p_a \\ c_vS – c_vd_v = oop_a + p_a – p_v – d_v \\ c_vS = oop_a + p_a – p_v + c_vd_v – d_v \\ S = \frac{oop_a + p_a – p_v + c_vd_v – d_v}{c_v} \]

As an R function it looks like this.

#' @title calculate_crossover_points
#' @description Given healthcare parameters for two plans, calculate when one plan becomes more expensive than the other.
#' @details Calculates the potential crossover points for different scenarios and returns the ones that are true crossovers.
#' @author Jared P. Lander
#' @param premiums_1 The annual premiums for plan 1
#' @param deductible_1 The deductible plan 1
#' @param coinsurance_1 The coinsurance percentage for spend beyond the deductible for plan 1
#' @param oop_maximum_1 The maximum amount of money (not including premiums) that the insured will pay under plan 1
#' @param premiums_2 The annual premiums for plan 2
#' @param deductible_2 The deductible plan 2
#' @param coinsurance_2 The coinsurance percentage for spend beyond the deductible for plan 2
#' @param oop_maximum_2 The maximum amount of money (not including premiums) that the insured will pay under plan 2
#' @return The amount of spend at which point one plan becomes more expensive than the other
#' @examples
#' calculate_crossover_points(
#' 160, 1000, 0.2, 6000,
#' 240, 800, 0.1, 5000
#' )
#'
calculate_crossover_points <- function(
    premiums_1, deductible_1, coinsurance_1, oop_maximum_1,
    premiums_2, deductible_2, coinsurance_2, oop_maximum_2
)
{
    # calculate the crossover before either has maxed out
    neither_maxed_out <- (premiums_2 - premiums_1 + 
                              deductible_2*(1 - coinsurance_2) - 
                              deductible_1*(1 - coinsurance_1)) /
        (coinsurance_1 - coinsurance_2)
    
    # calculate the crossover when one plan has maxed out but the other has not
    one_maxed_out <- (oop_maximum_1 + 
                          premiums_1 - premiums_2 +
                          coinsurance_2*deductible_2 -
                          deductible_2) /
        coinsurance_2
    
    # calculate the crossover for the reverse
    other_maxed_out <- (oop_maximum_2 + 
                            premiums_2 - premiums_1 +
                            coinsurance_1*deductible_1 -
                            deductible_1) /
        coinsurance_1
    
    # these are all possible points where the curves cross
    all_roots <- c(neither_maxed_out, one_maxed_out, other_maxed_out)
    
    # now calculate the difference between the two plans to ensure that these are true crossover points
    all_differences <- cost(all_roots, premiums_1, deductible_1, coinsurance_1, oop_maximum_1) - 
        cost(all_roots, premiums_2, deductible_2, coinsurance_2, oop_maximum_2)
    
    # only when the difference between plans is 0 are the curves truly crossing
    all_roots[all_differences == 0]
}

We then call the function with the parameters for both plans we are considering.

crossovers <- calculate_crossover_points(
    parameters$Premiums[1], parameters$Deductible[1], parameters$Coinsurance[1], parameters$OOP_Maximum[1],
    parameters$Premiums[2], parameters$Deductible[2], parameters$Coinsurance[2], parameters$OOP_Maximum[2]
)

crossovers

## [1] 20000 32000

We see that the Advantage plan is only cheaper than the Value plan when spending between $20,000 and $32,000.

The next question is will our healthcare spending fall in that narrow band between $20,000 and $32,000 where the Advantage plan is the cheaper option?

Probability of Spending

This part gets tricky. I’d like to figure out the probability of spending between $20,000 and $32,000. Unfortunately, it is not easy to find healthcare spending data due to the opaque healthcare system. So I am going to make a number of assumptions. This will likely violate a few principles, but it is better than nothing.

Assumptions and calculations:

Healthcare spending follows a log-normal distribution
We will work with New York State data which is possibly different than New York City data
We know the mean for New York spending in 2014
We will use the accompanying annual growth rate to estimate mean spending in 2019
We have the national standard deviation for spending in 2009
In order to figure out the standard deviation for New York, we calculate how different the New York mean is from the national mean as a multiple, then multiply the national standard deviation by that number to approximate the New York standard deviation in 2009
We use the growth rate from before to estimate the New York standard deviation in 2019

First, we calculate the mean. The Centers for Medicare & Medicaid Services has data on total and per capita medical expenditures by state from 1991 to 2014 and includes the average annual percentage growth. Since the data are bundled in a zip with other files, I posted them on my site for easy access.

spend_data_url <- 'https://jaredlander.com/data/healthcare_spending_per_capita_1991_2014.csv'
health_spend <- read_csv(spend_data_url)

We then take just New York spending for 2014 and multiply it by the corresponding growth rate.

ny_spend <- health_spend %>% 
  # get just New York
  filter(State_Name == 'New York') %>% 
  # this row holds overall spending information
  filter(Item == 'Personal Health Care ($)') %>% 
  # we only need a few columns
  select(Y2014, Growth=Average_Annual_Percent_Growth) %>% 
  # we have to calculate the spending for 2019 by accounting for growth
  # after converting it to a percentage
  mutate(Y2019=Y2014*(1 + (Growth/100))^5)

ny_spend

Y2014	Growth	Y2019
9778	5	12479.48

The standard deviation is trickier. The best I can find was the standard deviation on the national level in 2009. In 2013 the Centers for Medicare & Medicaid Services wrote in Volume 3, Number 4 of Medicare & Medicaid Research Review an article titled Modeling Per Capita State Health Expenditure Variation: State-Level Characteristics Matter. Exhibit 2 shows that the standard deviation of healthcare spending was $1,241 for the entire country in 2009. We need to estimate the New York standard deviation from this and then account for growth into 2019.

Next, we figure out the difference between the New York State spending mean and the national mean as a multiple.

nation_spend <- health_spend %>% 
  filter(Item == 'Personal Health Care ($)') %>% 
  filter(Region_Name == 'United States') %>% 
  pull(Y2009)

ny_multiple <- ny_spend$Y2014/nation_spend

ny_multiple

## [1] 1.418746

We see that the New York average is 1.4187464 times the national average. So we multiply the national standard deviation from 2009 by this amount to estimate the New York State standard deviation and assume the same annual growth rate as the mean. Recall, we can multiply the standard deviation by a constant.

\[ \begin{align} \text{var}(x*c) &= c^2*\text{var}(x) \\ \text{sd}(x*c) &= c*\text{sd}(x) \end{align} \]

ny_spend <- ny_spend %>% 
  mutate(SD2019=1241*ny_multiple*(1 + (Growth/100))^10)

ny_spend

Y2014	Growth	Y2019	SD2019
9778	5	12479.48	2867.937

My original assumption was that spending would follow a normal distribution, but New York’s resident agricultural economist, JD Long, suggested that the spending distribution would have a floor at zero (a person cannot spend a negative amount) and a long right tail (there will be many people with lower levels of spending and a few people with very high levels of spending), so a log-normal distribution seems more appropriate.

\[ \text{spending} \sim \text{lognormal}(\text{log}(12479), \text{log}(2868)^2) \]

Visualized it looks like this.

draws <- tibble(
  Value=rlnorm(
    n=1200, 
    meanlog=log(ny_spend$Y2019), 
    sdlog=log(ny_spend$SD2019)
  )
)

ggplot(draws, aes(x=Value)) + geom_density() + xlim(0, 75000)

The log-normal distribution has a long right tail.

We can see that there is a very long right tail which means there are many low values and few high values.

Then the probability of spending between $20,000 and $32,000 can be calculated with plnorm().

plnorm(crossovers[2], meanlog=log(ny_spend$Y2019), sdlog=log(ny_spend$SD2019)) - 
  plnorm(crossovers[1], meanlog=log(ny_spend$Y2019), sdlog=log(ny_spend$SD2019))

## [1] 0.02345586

So we only have a 2.35% probability of our spending falling in that band where the Advantage plan is more cost effective. Meaning we have a 97.65% probability that the Value plan will cost less over the course of a year.

We can also calculate the expected cost under each plan. We do this by first calculating the probability of spending each (thousand) dollar amount (since the log-normal is a continuous distribution this is an estimated probability). We multiply each of those probabilities against their corresponding dollar amounts. Since the distribution is log-normal we need to exponentiate the resulting number. The data are on the thousands scale, so we multiply by 1000 to put it back on the dollar scale. Mathematically it looks like this.

\[ \mathbb{E}_{\text{Value}} \left[ \text{cost} \right] = 1000*\text{exp} \left\{ \sum p(\text{spend})*\text{cost}_{\text{Value}} \right\} \\ \mathbb{E}_{\text{Advantage}} \left[ \text{cost} \right] = 1000*\text{exp} \left\{ \sum p(\text{spend})*\text{cost}_{\text{Advantage}} \right\} \]

The following code calculates the expected cost for each plan.

spending %>% 
  # calculate the point-wise estimated probabilities of the healthcare spending
  # based on a log-normal distribution with the appropriate mean and standard deviation
  mutate(
    SpendProbability=dlnorm(
      Spend, 
      meanlog=log(ny_spend$Y2019), 
      sdlog=log(ny_spend$SD2019)
    )
  ) %>% 
  # compute the expected cost for each plan
  # and the difference between them
  summarize(
    ValueExpectedCost=sum(Value*SpendProbability),
    AdvantageExpectedCost=sum(Advantage*SpendProbability),
    ExpectedDifference=sum(Difference*SpendProbability)
  ) %>% 
  # exponentiate the numbers so they are on the original scale
  mutate_each(funs=exp) %>% 
  # the spending data is in increments of 1000
  # so multiply by 1000 to get them on the dollar scale
  mutate_each(funs=~ .x * 1000)

ValueExpectedCost	AdvantageExpectedCost	ExpectedDifference
5422.768	7179.485	1323.952

This shows that overall the Value plan is cheaper by about $1,324 dollars on average.

Conclusion

We see that there is a very small window of healthcare spending where the Advantage plan would be cheaper, and at most it would be about $600 cheaper than the Value plan. Further, the probability of falling in that small window of savings is just 2.35%.

So unless our spending will be between $20,000 and $32,000, which it likely will not be, it is a better idea to choose the Value plan.

Since the Value plan is so likely to be cheaper than the Advantage plan I wondered who would pick the Advantage plan. Economist Jon Hersh invokes behavioral economics to explain why people may select the Advantage plan. Some parts of the Advantage plan are lower than the Value plan, such as the deductible, coinsurance and out-of-pocket maximum. People see that under certain circumstances the Advantage plan would save them money and are enticed by that, not realizing how unlikely that would be. So they are hedging against a low probability situation. (A consideration I have not accounted for is family size. The number of members in a family can have a big impact on the overall spend and whether or not it falls into the narrow band where the Advantage plan is cheaper.)

In the end, the Value plan is very likely going to be cheaper than the Advantage plan.

Try it at Home

I created a Shiny app to allow users to plug in the numbers for their own plans. It is rudimentary, but it gives a sense for the relative costs of different plans.

Thanks

A big thanks to Jon Hersh, JD Long, Kaz Sakamoto, Rebecca Martin and Adam Hogan for reviewing this post.

Data scientists and R enthusiasts gathered for the 5th annual New York R Conference held on May 9th-11th. In front of a crowd of more than 300 attendees, 24 speakers gave presentations on topics ranging from deep learning and building packages in R to football and hockey analytics.

Speakers included: Andrew Gelman, Emily Robinson, Max Kuhn, Dan Chen, Jared Lander, Namita Nandakumar, Mike Band, Soumya Kalra, Brooke Watson, David Madigan, Jacqueline Nolis, Heather Nolis, Gabriela Hempfling, Wes McKinney, Noam Ross, Jim Savage, Ludmila Janda, Emily Dodwell, Michelle Gill, Krista Watts, Elizabeth Sweeney, Adam Chekroud, Amanda Dobbyn and Letisha Smith.

This year marked the ten-year anniversary of the New York Open Statistical Programming Meetup. It has been incredible to see the growth of meetup over the years. We now have over 10,000 members around the world!

Let’s take a look at some of the highlights from the conference:

Jonah Gabry Kicked Off “R” Week at the New York Open Statistical Programming Meetup with a Talk on Using Stan in R

Jonah Gabry kicking off R Week with a talk about the Stan ecosystem

Jonah Gabry from the Stan Development Team kicked off “R” week with a talk on making Bayes easier in the R ecosystem. Jonah went over the packages (rstanarm, rstantools, bayesplot and loo) which emulate other R model-fitting functions, unify function naming across Stan-based R packages, and develop plotting functions using ggplot objects.

50 Conference Attendees Participated in Pre-Conference Workshops on Thursday before the Conference

A group of people in a room

Description automatically generated — People learning about machine learning from Max Kuhn during the pre-conference workshops

On the Thursday before the two-day conference, more than 50 conference attendees arrived at Work-Bench a day early for a full day of workshops. This was the first year of the R Conference Workshop Series. Max Kuhn, Dan Chen, Elizabeth Sweeney and Kaz Sakamoto each led a workshop which covered the following topics:

Machine Learning with Caret (Max Kuhn)
Git for Data Science (Dan Chen)
Introduction to Survival Analysis (Elizabeth Sweeney)
Geospatial Statistics and Mapping in R (Kaz Sakamoto)

The Growth of R-Ladies Summed Up in Three Pictures…

A group of people posing for a photo

Description automatically generated

We are so excited to see the growth of the R-Ladies community and we appreciate their support for the NY R Conference over the years. Congratulations ladies!

Dr. Andrew Gelman Delivers Keynote Speech on the Fallacy of P-Values and Thinking like a Statistician—All Without Slides

A person standing in front of a curtain

Description automatically generated — Andrew Gelman wowing the crowd as usual

There wasn’t a soul in the crowd who wasn’t hanging on every word from Columbia professor Dr. Andrew Gelman. The only speaker with a 40-minute time slot, and the only speaker to not use slides, Dr. Gelman talked about life as a statistician, warned of the perils of p-values and stressed the importance of simulation—before data collection—to improve our understanding of possible real-life scenarios. “Only through simulating fake data, can you really have statistical confidence about whatever performance metric you’re aiming for,” Gelman noted.

While we try not to pick a favorite speaker, Dr. Gelman runs away with that title every time he comes to speak at the New York R Conference.

Jacqueline and Heather Nolis Taught Us to Not to Be Afraid of Deep Learning and Model Deployment in Production

A group of people standing in a room

Description automatically generated — Jacqueline Nolis showing how fun and easy deep learning can be

The final talk on day one was perhaps the most entertaining and insightful from the weekend. Jacqueline Nolis taught us how developing a deep learning model is easier than we thought and how humor can help us understand a complex idea in a simple form. Our top five favorite neural network-generate pet names: Dia, Spok, Jori, Lule, and Timuse!

On Saturday morning, Heather Nolis showed us how we can deploy the model into production. Heather walked through the steps involved in preparing an R model for production using containers (Docker) and container orchestration (Kubernetes) to share models throughout an organization or for the public. How can we put a model into production without your laptop running 24/7? By running the code safely on a server in the cloud!

Emily Robinson and Honey Berk Win Headphones for Most Tweets During the Conference

Two people standing in front of a curtain

Description automatically generated — Emily Robinson winning the tweeting grand prize

If you’re not following Emily Robinson (@robinson_es) and Honey Berk (@honeyberk), you’re missing out! Emily and Honey led all conference attendees in Twitter mentions according to our Twitter scorekeeper Malorie Hughes (@data_all_day). Because of Emily and Honey’s presence on Twitter, those who were unable to attend the conference were able to follow along with all of our incredible speakers throughout the two-day event.

https://pbs.twimg.com/media/D6TF7CdW0AAmqOM.jpg:large — Tweet stats dashboard created by Malorie Hughes

Jared Lander Debuts New-Born R Package Hex Sticker T-Shirts: Congratulations to Jared and Rebecca on the Birth of their Son, Lev

Baby themed hex sticker shirts designed by Vivian Peng

During my talk I debuted a custom R package hex sticker t-shirt with my wife Rebecca and son Lev. We R a very nerdy family.

Looking Forward to 2020

If you attended the 2019 New York R Conference, we hope you had an incredible experience. If you did not attend the conference, we hope to see you next year!

Ten years ago Josh Reich held the first ever New York Open Statistical Programming Meetup at Union Square Ventures. Back then it was called the New York R Meetup and 21 people RSVP’d. I discovered the Meetup three months later thanks to Andrew Gelman’s blog and by then the RSVP count had doubled and Drew Conway became a co-organizer. The experience was so much fun and I learned a lot of good stuff. I remember learning about the head() and tail() functions, wishing I learned them in grad school.

Early Meetup — The crowd at an early meetup fit in a small room at Columbia with room to spare.

I started attending regularly and pretty soon Drew decided to serve pizza which later led to years of pizza data. He also designed a logo for the NYC Data Mafia, which made for a great t-shirt that we still sell. One time, a number of us were talking and realized we were all answering each other’s questions on StackOverflow. Our community was growing both in person and online. I fell in love with the group because it was a great place to learn and hang out with smart, welcoming people.

During the first two years our hosts included NYU, Columbia, AOL and a handful of others. At this time there were about 1,800 members with Drew as the sole organizer who was ready to focus on other parts of his life, so he asked Wes McKinney and me to take over as organizers. This was after Drew renamed the group the Open Statistical Programming Meetup as to include other open source languages like Python, Julia, Go and SQL. I was incredibly thrilled to organize this group which meant so much to me.

Over the next eight years our numbers swelled to almost 11,000 with members and speakers coming from all over the world. We have held the Meetup at places such as eBay, AT&T, iHeartRadio, Work-Bench, Knewton, Twitter, New York Presbyterian, Rise New York and Google. The most popular event welcomed nearly 400 people when Hadley Wickham spoke in September of 2015, the only time the Meetup met on a Friday.

Hadley, Becky and Jared — The Meetup’s most popular night featured Hadley Wickham and coincided with my fifth date with Rebecca Martin, my future wife.

That night was also my fifth date with Rebecca Martin. We originally met during Michael Kane’s talk about PubMed then reconnected about a year later. We went on to get married and have a kid together. The New York Times used the nerdiest closing line ever for our wedding announcement: “The couple met in New York in May 2014 at a meet-up about statistical programming organized by the groom.”

The Meetup has grown not only in numbers but in reach as well. There’s a website hosting all of the presentations, we livestream the Meetups and people from all over the world chat in our Slack team. Our live events include an ongoing workshop series and conferences in New York and Washington DC, which just hit their fifth anniversary, all for building and supporting the community and open source software.

NY R Conference — The crowd at the New York R Conference.

These past ten years have been a collection of amazing experiences for me where I got to learn from some of the world’s best experts and develop lasting relationships with great people. This community means so much to me and I very much look forward to its continued growth over the next decade.

After four sold-out years in New York City, the R Conference made its debut in Washington DC to a sold-out crowd of data scientists at the Ronald Reagan Building on November 8th & 9th. Our speakers shared presentations on a variety of R-related topics.

A big thank you to our speakers Max Kuhn, Emily Robinson, Mike Powell, Mara Averick, Max Richman, Stephanie Hicks, Michael Garris, Kelly O’Briant, David Smith, Anna Kirchner, Roger Peng, Marck Vaisman, Soumya Kalra, Jonathan Hersh, Vivian Peng, Dan Chen, Catherine Zhou, Jim Klucar, Lizzy Huang, Refael Lav, Ami Gates, Abhijit Dasgupta, Angela Li and Tommy Jones.

Some highlights from the conference:

R Superstars Mara Averick, Roger Peng and Emily Robinson

A hallmark of our R conferences is that the speakers hang out with all the attendees and these three were crowd favorites.

Michael Powell Brings R to the aRmy

Major Michael Powell describes how R has brought efficiency to the Army Intelligence and Security Command by getting analysts out of Excel and into the Tidyverse. “Let me turn those 8 hours into 8 seconds for you,” says Powell.

Max Kuhn Explains the Applications of Equivocals to Apply Levels of Certainty to Predictions

After autographing his book, Applied Predictive Modeling, for a lucky attendee, Max Kuhn explains how Equivocals can be applied to individual predictions in order to avoid reporting predictions when there is significant uncertainty.

NYR and DCR Speaker Emily Robinson Getting an NYR Hoodie for her Awesome Tweeting

Emily Robinson tweeted the most at the 2018 NYR conference, winning her a WASD mechanical keyboard and at DCR she came in second so we gave her a limited edition NYR hoodie.

Max Richman Shows How SQL and R can Co-Exist

Max Richman, wearing the same shirt he wore when he spoke at the first NYR, shows parallels between dplyr and SQL.

Michael Garris Tells the Story of the MNIST Dataset

Michael Garris was a member of the team that built the original MNIST dataset, which set the standard for handwriting image classification in the early 1990s. This talk may have been the first time the origin story was ever told.

R Stats Luminary Roger Peng Explains Relationship Between Air Pollution and Public Health

Roger Peng shows us how air pollution levels has fallen over the past 50 years resulting in dramatic improvements in air quality and health (with help from R).

Kelly O’Briant Combining R with Serverless Computing

Kelly O’Briant demonstrates how to easily deploy R projects on Google Compute Engine and promoted the new #radmins hashtag.

Hot Dog vs Not Hot Dog by David Smith (Inspired by Jian-Yang from HBO’s Silicon Valley)

David Smith, one of the original R users, shows how to recreate HBO’s Silicon Valley’s Not Hot Dog app using R and Azure

Jon Hersh Describes How to Push for Data Science Within Your Organization

Jon Hersh discusses the challenges, and solutions, of getting organizations to embrace data science.

Vivian Peng and the Importance of Data Storytelling

Vivian Peng asks the question, how do we protect the integrity of our data analysis when it’s published for the world to see?

Dan Chen Signs His Book for David Smith

Dan Chen autographing a copy of his book, Pandas for Everyone, for David Smith. Now David Smith has to sign his book, An Introduction to R, for Dan.

Malorie Hughes Analyzing Tweets

On the first day I challenged the audience to analyze the tweets from the conference and Malorie Hughes, a data scientist with NPR, designed a Twitter analytics dashboard to track the attendee with the most tweets with the hashtag #rstatsdc. Seth Wenchel won a WASD keyboard for the best tweeting. And we presented Malorie wit a DCR speaker mug.

Strong Showing from the #RLadies!

The #rladies group is growing year after year and it is great seeing them in force at NYR and DCR!

Packages

Matthew Hendrickson, a DCR attendee, posted on twitter every package mentioned during the two-day conference: tidyverse, tidycensus, leaflet, leaflet.extras, funneljoin, glmnet, xgboost, rstan, rstanarm, LowRankQP, dplyr, coefplot, bayesplot, keras, tensorflow, lars, magrittr, purrr, rsample, useful, knitr, rmarkdown, ggplot2, ggiraph, ggrepel, ggraph, ggthemes, gganimate, ggmap, plotROC, ggridges, gtrendsr, tlnise, tm, Bioconductor, plyranges, sf, tmap, textmineR, tidytext, gmailr, rtweet, shiny, httr, parsnip, probably, plumber, reprex, crosstalk, arules and arulesviz.

Data Community DC

A special thanks to the Data Community DC for helping us make the DC R Conference an incredible experience.

Videos

The videos for the conference will be posted in the coming weeks to dc.rstats.ai.

See You Next Year

Looking forward to more great conferences at next year’s NYR and DCR!

The 2018 New York R Conference was the biggest and best yet. This is both in terms of the crowd size and content. The speakers included some of the R community’s best such as Hadley Wickham, David Robinson, Jennifer Hill, Max Kuhn, Andreas Mueller (ok, a little Python), Evelina Gabasova, Sean Taylor and Jeff Ryan. I am proud to say we were almost at gender parity for both attendees and speakers which is amazing for a tech conference. Brooke Watson even excitedly noted that we had a line for the women’s room.

Excited to announce that i am waiting in a LINE for the WOMEN’s restroom at a tech conference! Thanks @RLadiesNYC, @RLadiesGlobal, and @nyhackr for the opportunity, wouldn’t be here without your support

— Brooke Watson (@brookLYNevery1) April 21, 2018

Particularly gratifying for me was seeing so many of my students speak. Eurry Kim, Dan Chen and Alex Boghosian all gave excellent talks.

Some highlights that stuck out to me are:

Emily Robinson Shows There is More to the Tidyverse than Hadley

Emily Robinson, otherwise known as ERob, gave an excellent talk showing how the Tidyverse is so much more than just Hadley and that there are many people inspired by him to contribute in the Tidy way.

Sean Taylor Forecasted the Future with Prophet

Sean Taylor, former New Yorker and unrepentant Eagles fan, demonstrate his powerful R and Python, package Prophet, for forecasting time series data. Facebook open sourced his work so we could all benefit.

OG Data Mafia Founder Drew Conway Popped In

A lucky fan got an autographed NYC Data Mafia t-shirt from Drew Conway.

David Smith Playing Minecraft Through R

David Smith played Minecraft through R, including building objects and moving through the world.

Evelina Gabasova Used Social Network Analysis to Break Down Star Wars

Evelina Gabasova wowed the audience with her fun talk and detailed analysis of character interaction in Star Wars.

Dusty Turner Represented West Point

Dusty Turner taught us how the United States Military Academy uses R for both student instruction and evaluation.

Hadley Wickham Delved into the Nitty Gritty of R

Hadley Wickham showed us how to get into the internals of R and figure out how to examine objects from a memory perspective.

Jennifer Hill Demonstrated Awesome Machine Learning Techniques for Causal Inference

Following her sold-out meetup appearance in March, Jennifer continued to push the boundaries of causal inference.

I Made the Authors of Caret and scitkit-learn Show That R and Python Can Get Along

While both Andreas and Max gave great individual talks, I made them pose for this peace-making photo.

David Robinson Got the Upper Hand in a Sibling Twitter Duel

Given only about 30 minutes notice, David put together an entire slideshow on how to livetweet and how to compete with your sibling.

In the End Emily Robinson Beat Her Brother For Best Tweeting

Despite David’s headstart Emily was the best tweeter (as calculated by Max Kuhn and Mara Averick) so she won the WASD Code mechanical keyboard with MX Cherry Clear switches.

Silent Auction of Data Paintings

Thomas Levine made paintings of famous datasets that we auctioned off with the proceeds supporting the R Foundation and the Free Software Foundation. The Robinson family very graciously chipped in and bought the painting of the Pizza Poll data for me! I’m still floored by this and in love with the painting.

Ice Cream Sandwiches

In addition to bagels and eggs sandwiches from Murray’s Bagels, Israeli food from Hummus and Pita Company, avocado toast and coffee from Bluestone Lane Coffee and pizza from Fiore’s, we also had ice creams sandwiches from World’s Best Cookie Dough.

All the Material

To catch up on all the presentations check out Mara Averick’s excellent notes:

Rest of #rstatsnyc Day 1 (before my little ❤ couldn’t take it) feat. @jaredlander @chendaniely @seanjtaylor @noamross @brookLYNevery1 @dtdusty @DogmaticPrior @wesmckinn #rstats 🗽 pic.twitter.com/pTOXTPTO5b

— Mara Averick (@dataandme) April 23, 2018

Or check out all of Brooke’s drawings, collated by Dan Chen.

hey #rstatsnyc! I put together all of @brookLYNevery1‘s lovely illustrations and tweets for each speaker into a simple #tldr blog post: https://t.co/LVk6963P9W

There’s a pic in this tweet b/c @drob said it was good to have pictures :p#rstats #nycdatamafia #python pic.twitter.com/zU1DfhFlKk

— Dⓐniel Chen (@chendaniely) April 23, 2018

Videos and Upcoming Events

The videos will be posted at rstats.nyc in a few weeks for all to enjoy.

There are a number of other events coming up including:

May 9: Tidy by Nature: Down with OPP (other people’s pipes)
May 17-18: Caret Master Class with Max Kuhn
June 21: While not officially announced yet, the June meetup features Rob Hyndman
June 25-27: Forecasting Master Class with Rob Hyndman
August 6-8: Stan and Bayes Master Class with Jonah Gabry

We are already beginning plans for next year’s conference and are working on bringing it to DC as well! Stay tuned for all that and more.

In my last post I discussed using coefplot on glmnet models and in particular discussed a brand new function, coefpath, that uses dygraphs to make an interactive visualization of the coefficient path.

Another new capability for version 1.2.5 of coefplot is the ability to show coefficient plots from xgboost models. Beyond fitting boosted trees and boosted forests, xgboost can also fit a boosted Elastic Net. This makes it a nice alternative to glmnet even though it might not have some of the same user niceties.

To illustrate, we use the same data as our previous post.

First, we load the packages we need and note the version numbers.

# list the packages that we load
# alphabetically for reproducibility
packages <- c('caret', 'coefplot', 'DT', 'xgboost')
# call library on each package
purrr::walk(packages, library, character.only=TRUE)

# some packages we will reference without actually loading
# they are listed here for complete documentation
packagesColon <- c('dplyr', 'dygraphs', 'knitr', 'magrittr', 'purrr', 'tibble', 'useful')

versions <- c(packages, packagesColon) %>% 
    purrr::map(packageVersion) %>% 
    purrr::map_chr(as.character)
packageDF <- tibble::data_frame(Package=c(packages, packagesColon), Version=versions) %>% 
    dplyr::arrange(Package)
knitr::kable(packageDF)

Package	Version
caret	6.0.78
coefplot	1.2.6
dplyr	0.7.4
DT	0.2
dygraphs	1.1.1.4
knitr	1.18
magrittr	1.5
purrr	0.2.4
tibble	1.4.2
useful	1.2.3
xgboost	0.6.4

Then, we read the data. The data are available at https://www.jaredlander.com/data/manhattan_Train.rds with the CSV version at data.world. We also get validation data which is helpful when fitting xgboost mdoels.

manTrain <- readRDS(url('https://www.jaredlander.com/data/manhattan_Train.rds'))
manVal <- readRDS(url('https://www.jaredlander.com/data/manhattan_Validate.rds'))

The data are about New York City land value and have many columns. A sample of the data follows. There’s an odd bug where you have to click on one of the column names for the data to display the actual data.

datatable(manTrain %>% dplyr::sample_n(size=1000), elementId='TrainingSampled',
              rownames=FALSE,
              extensions=c('FixedHeader', 'Scroller'),
              options=list(
                  scroller=TRUE
              ))

<br />

While glmnet automatically standardizes the input data, xgboost does not, so we calculate that manually. We use preprocess from caret to compute the mean and standard deviation of each numeric column then use these later.

preProc <- preProcess(manTrain, method=c('center', 'scale'))

Just like with glmnet, we need to convert our tbl into an X (predictor) matrix and a Y (response) vector. Since we don’t have to worry about multicolinearity with xgboost we do not want to drop the baselines of factors. We also take advantage of sparse matrices since that reduces memory usage and compute, even though this dataset is not that large.

In order to build the matrix and vector we need a formula. This could be built programmatically, but we can just build it ourselves. The response is TotalValue.

valueFormula <- TotalValue ~ FireService + ZoneDist1 + ZoneDist2 +
    Class + LandUse + OwnerType + LotArea + BldgArea + ComArea + ResArea +
    OfficeArea + RetailArea + NumBldgs + NumFloors + UnitsRes + UnitsTotal + 
    LotDepth + LotFront + BldgFront + LotType + HistoricDistrict + Built + 
    Landmark

manX <- useful::build.x(valueFormula, data=predict(preProc, manTrain),
                        # do not drop the baselines of factors
                        contrasts=FALSE,
                        # use a sparse matrix
                        sparse=TRUE)

manY <- useful::build.y(valueFormula, data=manTrain)

manX_val <- useful::build.x(valueFormula, data=predict(preProc, manVal),
                        # do not drop the baselines of factors
                        contrasts=FALSE,
                        # use a sparse matrix
                        sparse=TRUE)

manY_val <- useful::build.y(valueFormula, data=manVal)

There are two functions we can use to fit xgboost models, the eponymous xgboost and xgb.train. When using xgb.train we first store our X and Y matrices in a special xgb.DMatrix object. This is not a necessary step, but makes things a bit cleaner.

manXG <- xgb.DMatrix(data=manX, label=manY)
manXG_val <- xgb.DMatrix(data=manX_val, label=manY_val)

We are now ready to fit a model. All we need to do to fit a linear model instead of a tree is set booster='gblinear' and objective='reg:linear'.

mod1 <- xgb.train(
    # the X and Y training data
    data=manXG,
    # use a linear model
    booster='gblinear',
    # minimize the a regression criterion 
    objective='reg:linear',
    # use MAE as a measure of quality
    eval_metric=c('mae'),
    # boost for up to 500 rounds
    nrounds=500,
    # print out the eval_metric for both the train and validation data
    watchlist=list(train=manXG, validate=manXG_val),
    # print eval_metric every 10 rounds
    print_every_n=10,
    # if the validate eval_metric hasn't improved by this many rounds, stop early
    early_stopping_rounds=25,
    # penalty terms for the L2 portion of the Elastic Net
    lambda=10, lambda_bias=10,
    # penalty term for the L1 portion of the Elastic Net
    alpha=900000000,
    # randomly sample rows
    subsample=0.8,
    # randomly sample columns
    col_subsample=0.7,
    # set the learning rate for gradient descent
    eta=0.1
)

## [1]  train-mae:1190145.875000    validate-mae:1433464.750000 
## Multiple eval metrics are present. Will use validate_mae for early stopping.
## Will train until validate_mae hasn't improved in 25 rounds.
## 
## [11] train-mae:938069.937500 validate-mae:1257632.000000 
## [21] train-mae:932016.625000 validate-mae:1113554.625000 
## [31] train-mae:931483.500000 validate-mae:1062618.250000 
## [41] train-mae:931146.750000 validate-mae:1054833.625000 
## [51] train-mae:930707.312500 validate-mae:1062881.375000 
## [61] train-mae:930137.375000 validate-mae:1077038.875000 
## Stopping. Best iteration:
## [41] train-mae:931146.750000 validate-mae:1054833.625000

The best fit was arrived at after 41 rounds. We can see how the model did on the train and validate sets using dygraphs.

dygraphs::dygraph(mod1$evaluation_log)

<br />

We can now plot the coefficients using coefplot. Since xgboost does not save column names, we specify it with feature_names=colnames(manX). Unlike with glmnet models, there is only one penalty so we do not need to specify a specific penalty to plot.

coefplot(mod1, feature_names=colnames(manX), sort='magnitude')

This is another nice addition to coefplot utilizing the power of xgboost.

Related Posts

Getting the Data

Preparing the Data

Skip All of This

Visualizing with Interactive Plots

Computing Cross Correlations

Fitting Time Series Models with {fable}

Running Everything with {targets}

Reporting with RStudio Connect

What Comes Next?

Related Posts

Looking Forward to New York, R|Gov, and Dublin!

Related Posts

Related Posts

The costs involved with a health insurance plan can be confusing so I perform an analysis of different options to find which plan is most cost effective

Cost

R Packages

Spending

Probability of Spending

Conclusion

Try it at Home

Thanks

Related Posts

Jonah Gabry Kicked Off “R” Week at the New York Open Statistical Programming Meetup with a Talk on Using Stan in R

50 Conference Attendees Participated in Pre-Conference Workshops on Thursday before the Conference

The Growth of R-Ladies Summed Up in Three Pictures…

Dr. Andrew Gelman Delivers Keynote Speech on the Fallacy of P-Values and Thinking like a Statistician—All Without Slides

Jacqueline and Heather Nolis Taught Us to Not to Be Afraid of Deep Learning and Model Deployment in Production

Emily Robinson and Honey Berk Win Headphones for Most Tweets During the Conference

Jared Lander Debuts New-Born R Package Hex Sticker T-Shirts: Congratulations to Jared and Rebecca on the Birth of their Son, Lev

Looking Forward to 2020

Related Posts

Related Posts

Some highlights from the conference:

R Superstars Mara Averick, Roger Peng and Emily Robinson

Michael Powell Brings R to the aRmy

Max Kuhn Explains the Applications of Equivocals to Apply Levels of Certainty to Predictions

NYR and DCR Speaker Emily Robinson Getting an NYR Hoodie for her Awesome Tweeting

Max Richman Shows How SQL and R can Co-Exist

Michael Garris Tells the Story of the MNIST Dataset

R Stats Luminary Roger Peng Explains Relationship Between Air Pollution and Public Health

Kelly O’Briant Combining R with Serverless Computing

Hot Dog vs Not Hot Dog by David Smith (Inspired by Jian-Yang from HBO’s Silicon Valley)

Jon Hersh Describes How to Push for Data Science Within Your Organization

Vivian Peng and the Importance of Data Storytelling

Dan Chen Signs His Book for David Smith

Malorie Hughes Analyzing Tweets

Strong Showing from the #RLadies!

Packages

Data Community DC

Videos

See You Next Year

Related Posts

Emily Robinson Shows There is More to the Tidyverse than Hadley

Sean Taylor Forecasted the Future with Prophet

OG Data Mafia Founder Drew Conway Popped In

David Smith Playing Minecraft Through R

Evelina Gabasova Used Social Network Analysis to Break Down Star Wars

Dusty Turner Represented West Point

Hadley Wickham Delved into the Nitty Gritty of R

Jennifer Hill Demonstrated Awesome Machine Learning Techniques for Causal Inference

I Made the Authors of Caret and scitkit-learn Show That R and Python Can Get Along

David Robinson Got the Upper Hand in a Sibling Twitter Duel

In the End Emily Robinson Beat Her Brother For Best Tweeting

Silent Auction of Data Paintings

Ice Cream Sandwiches

All the Material

Videos and Upcoming Events

Related Posts

Related Posts

Fitting Time Series Models with `{fable}`

Running Everything with `{targets}`