In previous posts I talked about collecting temperature data from different rooms in my house and automating that process using Docker and GitHub Actions. Now it is time to analyze that data to figure out what is happening in the house to make informed decisions about replacing the HVAC units and how I can make the house more comfortable for everyone until then.

During this process I pull the data into R from a DigitalOcean space (like an AWS S3 bucket) using {arrow}, manipulate the data with {dplyr}, {tidyr} and {tsibble}, visualize the data with {timetk}, fit models with {fable} and visualize the models with {coefplot}.

# Getting the Data

As discussed in the post about tracking the data, we created a CSV for each day of data in a DigitalOcean space. The CSV tracks room temperatures, thermostat settings, outdoor temperature, etc for every five-minute interval during each day.

In the past, if multiple files had to be read into a single data.frame or tibble the best course of action would have been to combine map_df() from {purrr} with read_csv() or fread() from {readr} and {data.table}, respectively. But as seen in Wes McKinney and Neal Richardson’s talk from the 2020 New York R Conference, using open_data() from {arrow} then collecting the data with dplyr::collect() should be faster. An added bonus is that {arrow} has functionality built in to pull from an S3 bucket, and that includes DigitalOcean spaces.

The first step is to use arrow::S3FileSystem$create() to make a reference to the S3 file system. Several pieces of information are needed for this: • access_key: The DigitalOcean Access Key created at https://cloud.digitalocean.com/account/api/tokens. • secret_key: The DigitalOcean Secret Key created at https://cloud.digitalocean.com/account/api/tokens. • scheme: This should just be "https". • endpoint_override: Since we are using DigitalOcean and not AWS we need to change the default to something along the lines of "nyc3.digitaloceanspaces.com", depending on the region. The Access Key can be retrieved at any time, but the Secret Key is only displayed one time, so we must save it somewhere. This is all the same information used when writing the files to the DigitalOcean space in the first place. We also save it all in environment variables in the .Renviron file to avoid exposing this sensitive information in our code. It is very important to not check this file into git to reduce risk of exposing the private keys. space <- arrow::S3FileSystem$create(
access_key=Sys.getenv('AWS_ACCESS_KEY_ID'),
secret_key=Sys.getenv('AWS_SECRET_ACCESS_KEY'),
scheme="https",
endpoint_override=glue::glue("{Sys.getenv('DO_REGION')}.{Sys.getenv('DO_BASE')}")
)
space
## S3FileSystem

The space object represents the S3 file system we have access to. Its elements are mostly file system type functions such as cd() ls(), CopyFile() and GetFileInfo() and are accessed via space$function_name(). To see a listing of files in a folder inside a bucket we call space$ls() with the bucket and folder name as the argument, in quotes.

space$ls('bucket_name/folder_name') ## [1] "bucket_name/folder_name/2021-01-01.csv" ## [2] "bucket_name/folder_name/2021-01-02.csv" ## [3] "bucket_name/folder_name/2021-01-03.csv" ## [4] "bucket_name/folder_name/2021-01-04.csv" ## [5] "bucket_name/folder_name/2021-01-05.csv" ## [6] "bucket_name/folder_name/2021-01-06.csv" ## [7] "bucket_name/folder_name/2021-01-07.csv" ## [8] "bucket_name/folder_name/2021-01-08.csv" ## [9] "bucket_name/folder_name/2021-01-09.csv" ## [10] "bucket_name/folder_name/2021-01-10.csv" ## [11] "bucket_name/folder_name/2021-01-11.csv" ## [12] "bucket_name/folder_name/2021-01-12.csv" ## [13] "bucket_name/folder_name/2021-01-13.csv" ## [14] "bucket_name/folder_name/2021-01-14.csv" ## [15] "bucket_name/folder_name/2021-01-15.csv" ## [16] "bucket_name/folder_name/2021-01-16.csv" ## [17] "bucket_name/folder_name/2021-01-17.csv" ## [18] "bucket_name/folder_name/2021-01-18.csv" ## [19] "bucket_name/folder_name/2021-01-19.csv" ## [20] "bucket_name/folder_name/2021-01-20.csv" ## [21] "bucket_name/folder_name/2021-01-21.csv" ## [22] "bucket_name/folder_name/2021-01-22.csv" ## [23] "bucket_name/folder_name/2021-01-23.csv" ## [24] "bucket_name/folder_name/2021-01-24.csv" ## [25] "bucket_name/folder_name/2021-01-25.csv" ## [26] "bucket_name/folder_name/2021-01-26.csv" ## [27] "bucket_name/folder_name/2021-01-27.csv" ## [28] "bucket_name/folder_name/2021-01-28.csv" ## [29] "bucket_name/folder_name/2021-01-29.csv" ## [30] "bucket_name/folder_name/2021-01-30.csv" ## [31] "bucket_name/folder_name/2021-01-31.csv" ## [32] "bucket_name/folder_name/2021-02-01.csv" ## [33] "bucket_name/folder_name/2021-02-02.csv" ## [34] "bucket_name/folder_name/2021-02-03.csv" ## [35] "bucket_name/folder_name/2021-02-04.csv" ## [36] "bucket_name/folder_name/2021-02-05.csv" ## [37] "bucket_name/folder_name/2021-02-06.csv" ## [38] "bucket_name/folder_name/2021-02-07.csv" ## [39] "bucket_name/folder_name/2021-02-08.csv" ## [40] "bucket_name/folder_name/2021-02-09.csv" ## [41] "bucket_name/folder_name/2021-02-10.csv" ## [42] "bucket_name/folder_name/2021-02-11.csv" ## [43] "bucket_name/folder_name/2021-02-12.csv" ## [44] "bucket_name/folder_name/2021-02-13.csv" ## [45] "bucket_name/folder_name/2021-02-14.csv" ## [46] "bucket_name/folder_name/2021-02-15.csv" ## [47] "bucket_name/folder_name/2021-02-16.csv" ## [48] "bucket_name/folder_name/2021-02-17.csv" ## [49] "bucket_name/folder_name/2021-02-18.csv" ## [50] "bucket_name/folder_name/2021-02-19.csv" ## [51] "bucket_name/folder_name/2021-02-20.csv" ## [52] "bucket_name/folder_name/2021-02-21.csv" ## [53] "bucket_name/folder_name/2021-02-22.csv" ## [54] "bucket_name/folder_name/2021-02-23.csv" To use open_dataset() we need a path to the folder holding the files. This is built with space$path(). The structure is "bucket_name/folder_name". For security, even the bucket and folder names are stored in environment variables. For this example, the real values have been substituted with "bucket_name" and "folder_name".

files_path <- space$path(glue::glue("{Sys.getenv('DO_SPACE')}/{Sys.getenv('DO_FOLDER')}")) files_path ## SubTreeFileSystem: s3://bucket_name/folder_name/ Now we can call open_dataset() to get access to all the files in that folder. We need to specify format="csv" to let the function know the files are CSVs as opposed to parquet or other formats. files_path <- space$path(glue::glue("{Sys.getenv('DO_SPACE')}/{Sys.getenv('DO_FOLDER')}"))

temps_dataset <- arrow::open_dataset(files_path, format='csv')
## FileSystemDataset with 66 csv files
## Name: string
## Sensor: string
## date: date32[day]
## time: string
## temperature: double
## occupancy: int64
## Thermostat: int64
## zoneAveTemp: double
## HVACmode: string
## fan: int64
## outdoorTemp: double
## outdoorHumidity: int64
## sky: int64
## wind: int64
## zoneClimate: string
## zoneCoolTemp: double
## zoneHeatTemp: double
## zoneHVACmode: string
## zoneOccupancy: int64

Printing out the temps_dataset object shows the column names along with the type of data stored in each column. {arrow} is pretty great and lets us do column selection and row filtering on the data sitting in files which opens up a whole world of data analysis on data too large to fit in memory. We are simply going to select columns of interest and collect all the data into one data.frame (actually a tibble). Notice we call select() before collect() because this reduces the number of columns being transmitted over the network.

library(dplyr)
temps_raw %>% head()
## # A tibble: 6 x 16
##   Name  Sensor date       time  temperature   fan zoneAveTemp HVACmode
##   <chr> <chr>  <date>     <chr>       <dbl> <int>       <dbl> <chr>
## 1 Upst… Bedro… 2021-01-01 00:0…        67.8   300          70 heat
## 2 Upst… Bedro… 2021-01-01 00:0…        72.4   300          70 heat
## 3 Upst… Offic… 2021-01-01 00:0…        NA     300          70 heat
## 4 Upst… Upsta… 2021-01-01 00:0…        63.1   300          70 heat
## 5 Upst… Offic… 2021-01-01 00:0…        NA     300          70 heat
## 6 Upst… Offic… 2021-01-01 00:0…        NA     300          70 heat
## # … with 8 more variables: outdoorTemp <dbl>, outdoorHumidity <int>, sky <int>,
## #   wind <int>, zoneClimate <chr>, zoneHeatTemp <dbl>, zoneCoolTemp <dbl>,
## #   zoneHVACmode <chr>

# Preparing the Data

The data are in need of some minor attention. First, the time column is read in as a character instead of time. This is a known issue, so instead we combine it with the date column using tidyr::unite() then convert this combination into a datetime (or timestamp) with lubridate::as_datetime(), which requires us to set a time zone. Then, the sensors named Upstairs Thermostat and Downstairs Thermostat actually represent Bedroom 3 and Dining Room, respectively, so we rename those using case_when() from {dplyr}, leaving all the other sensor names as is. Then we only keep rows where the time is earlier than the present. This is due to a quirk of the Ecobee API where it can return future readings, which happens because we request all of the latest day’s data.

temps <- temps_raw %>%
tidyr::unite(col='time', date, time, sep=' ', remove=TRUE) %>%
mutate(time=lubridate::as_datetime(time, tz=Sys.timezone())) %>%
mutate(
Sensor=case_when(
Sensor == 'Upstairs Thermostat' ~ 'Bedroom 3',
Sensor == 'Downstairs Thermostat' ~ 'Dining Room',
TRUE ~ Sensor
)
) %>%
filter(time <= lubridate::now())

## Skip All of This

Since the actual data may contain sensitive information a sanitized version is stored as parquet files on a public DigitalOcean space. This dataset is not as up to date but it will do for those that want to follow along.

publicspace <- arrow::S3FileSystem$create( # anonymous means we do not need to provide credentials # is crucial, otherwise the function looks for credentials anonymous=TRUE, # and crashes R if it can't find them scheme="https", # the data are stored in the nyc3 region endpoint_override='nyc3.digitaloceanspaces.com' ) publicspace$ls('temperaturedata', recursive=TRUE)
## [1] "temperaturedata/2021"
## [2] "temperaturedata/2021/1"
## [3] "temperaturedata/2021/1/part-0.parquet"
## [4] "temperaturedata/2021/2"
## [5] "temperaturedata/2021/2/part-1.parquet"
publictemp <- arrow::open_dataset(publicspace$path("temperaturedata")) %>% collect() publictemp %>% head() ## # A tibble: 6 x 15 ## Name Sensor time temperature fan zoneAveTemp HVACmode ## <chr> <chr> <dttm> <dbl> <int> <dbl> <chr> ## 1 Upst… Bedro… 2021-01-01 00:00:00 67.8 300 70 heat ## 2 Upst… Bedro… 2021-01-01 00:00:00 72.4 300 70 heat ## 3 Upst… Offic… 2021-01-01 00:00:00 NA 300 70 heat ## 4 Upst… Bedro… 2021-01-01 00:00:00 63.1 300 70 heat ## 5 Upst… Offic… 2021-01-01 00:00:00 NA 300 70 heat ## 6 Upst… Offic… 2021-01-01 00:00:00 NA 300 70 heat ## # … with 8 more variables: outdoorTemp <dbl>, outdoorHumidity <int>, sky <int>, ## # wind <int>, zoneClimate <chr>, zoneHeatTemp <dbl>, zoneCoolTemp <dbl>, ## # zoneHVACmode <chr> # Visualizing with Interactive Plots The first step in any good analysis is visualizing the data. In the past, I would have recommended {dygraphs} but have since come around to {feasts} from Rob Hyndman’s team and {timetk} from Matt Dancho due to their ease of use, better flexibility with data types and because I personally think they are more visually pleasing. In order to show the outdoor temperature on the same footing as the sensor temperatures, we select the pertinent columns with the outdoor data from the temps tibble and stack them at the bottom of temps, after accounting for duplicates (Each sensor has an outdoor reading, which is the same across sensors). all_temps <- dplyr::bind_rows( temps, temps %>% dplyr::select(Name, Sensor, time, temperature=outdoorTemp) %>% dplyr::mutate(Sensor='Outdoors', Name='Outdoors') %>% dplyr::distinct() ) Then we use plot_time_series() from {timetk} to make a plot that has a different colored line for each sensor and for the outdoors (with sensors being grouped into facets according to which thermostat they are associated with). Ordinarily the interactive version would be better (.interactive=TRUE below), but for the sake of this post static displays better. library(timetk) all_temps %>% group_by(Name) %>% plot_time_series( .date_var=time, .value=temperature, .color_var=Sensor, .smooth=FALSE, .interactive=FALSE ) ## Warning: Removed 34776 row(s) containing missing values (geom_path). From this we can see a few interesting patterns. Bedroom 1 is consistently higher than the other bedrooms, but lower than the offices which are all on the same thermostat. All the downstairs rooms see their temperatures drop overnight when the thermostat is lowered to conserve energy. Between February 12 and 18 this dip was not as dramatic, due to running an experiment to see how raising the temperature on the downstairs thermostat would affect rooms on the second floor. # Computing Cross Correlations One room in particular, Bedroom 2, always seemed colder than all the rest of the rooms, so I wanted to figure out why. Was it affected by physically adjacent rooms? Or maybe it was controlled by the downstairs thermostat rather than the upstairs thermostat as presumed. So the first step was to see which rooms were correlated with each other. But computing correlation with time series data isn’t as simple because we need to account for lagged correlations. This is done with the cross-correlation function (ccf). In order to compute the ccf, we need to get the data into a time series format. While R has extensive formats, most notably the built-in ts and its extension xts by Jeff Ryan, we are going to use tsibble, from Rob Hyndman and team, which are tibbles with a time component. The tsibble object can treat multiple time series as individual data and this requires the data to be in long format. For this analysis, we want to treat multiple time series as interconnected, so we need to put the data into wide format using pivot_wider(). wide_temps <- all_temps %>% # we only need these columns select(Sensor, time, temperature) %>% # make each time series its own column tidyr::pivot_wider( id_cols=time, names_from=Sensor, values_from=temperature ) %>% # convert into a tsibble tsibble::as_tsibble(index=time) %>% # fill down any NAs tidyr::fill() wide_temps ## # A tibble: 15,840 x 12 ## time Bedroom 2 Bedroom 1 Office 1 Bedroom 3 Office 3 ## <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2021-01-01 00:00:00 67.8 72.4 NA 63.1 NA ## 2 2021-01-01 00:05:00 67.8 72.8 NA 62.7 NA ## 3 2021-01-01 00:10:00 67.9 73.3 NA 62.2 NA ## 4 2021-01-01 00:15:00 67.8 73.6 NA 62.2 NA ## 5 2021-01-01 00:20:00 67.7 73.2 NA 62.3 NA ## 6 2021-01-01 00:25:00 67.6 72.7 NA 62.5 NA ## 7 2021-01-01 00:30:00 67.6 72.3 NA 62.7 NA ## 8 2021-01-01 00:35:00 67.6 72.5 NA 62.5 NA ## 9 2021-01-01 00:40:00 67.6 72.7 NA 62.4 NA ## 10 2021-01-01 00:45:00 67.6 72.7 NA 61.9 NA ## # … with 15,830 more rows, and 6 more variables: Office 2 <dbl>, Living ## # Room <dbl>, Playroom <dbl>, Kitchen <dbl>, Dining Room <dbl>, ## # Outdoors <dbl> Some rooms’ sensors were added after the others so there are a number of NAs at the beginning of the tracked time. We compute the ccf using CCF() from the {feasts} package then generate a plot by piping that result into autoplot(). Besides a tsibble, CCF() needs two arguments, the columns whose cross-correlation is being computed. library(feasts) wide_temps %>% CCF(Living Room, Bedroom 2) %>% autoplot() The negative lags are when the first column is a leading indicator of the second column and the positive lags are when the first column is a lagging indicator of the second column. In this case, Living Room seems to follow Bedroom 2 rather than lead. We want to compute the ccf between Bedroom 2 and every other room. We call CCF() on each column in wide_temps against Bedroom 2, including Bedroom 2 itself, by iterating over the columns of wide_temps with map(). We use a ~ to treat CCF() as a quoted function and wrap .x in double curly braces to take advantage of non-standard evaluation. This creates a list of objects that can be plotted with autoplot(). We make a list of ggplot objects by iterating over the ccf objects using imap(). This allows us to use the name of each element of the list in ggtitle. Lastly, wrap_plots() from {patchwork} is used to show all the plots in one display. library(patchwork) library(ggplot2) room_ccf <- wide_temps %>% purrr::map(~CCF(wide_temps, {{.x}}, Bedroom 2)) # we don't want the CCF with time, outdoor temp or the Bedroom 2 itself room_ccf$time <- NULL
room_ccf$Outdoors <- NULL room_ccf$Bedroom 2 <- NULL

biggest_cor <- room_ccf %>% purrr::map_dbl(~max(.x$ccf)) %>% max() room_ccf_plots <- purrr::imap( room_ccf, ~ autoplot(.x) + # show the maximally cross-correlation value # so it easier to compare rooms geom_hline(yintercept=biggest_cor, color='red', linetype=2) + ggtitle(.y) + labs(x=NULL, y=NULL) + ggthemes::theme_few() ) wrap_plots(room_ccf_plots, ncol=3) This brings up an odd result. Office 1, Office 2, Office 3, Bedroom 1, Bedroom 2 and Bedroom 3 are all controlled by the same thermostat, but Bedroom 2 is only really cross-correlated with Bedroom 3 (and negatively with Office 3). Instead, Bedroom 2 is most correlated with Playroom and Living Room, both of which are controlled by a different thermostat. So it will be interesting to see which thermostat setting (the set desired temperature) is most cross-correlated with Bedroom 2. At the same time we’ll check how much the outdoor temperature cross-correlates with Bedroom 2 as well. temp_settings <- all_temps %>% # there is a column for outdoor temperatures so we don't need these rows filter(Name != 'Outdoors') %>% select(Name, time, HVACmode, zoneCoolTemp, zoneHeatTemp) %>% # this information is repeated for each sensor, so we only need it once distinct() %>% # depending on the setting, the set temperature is in two columns mutate(setTemp=if_else(HVACmode == 'heat', zoneHeatTemp, zoneCoolTemp)) %>% tidyr::pivot_wider(id_cols=time, names_from=Name, values_from=setTemp) %>% # get the reading data right_join(wide_temps %>% select(time, Bedroom 2, Outdoors), by='time') %>% as_tsibble(index=time) temp_settings ## # A tibble: 15,840 x 5 ## time Upstairs Downstairs Bedroom 2 Outdoors ## <dttm> <dbl> <dbl> <dbl> <dbl> ## 1 2021-01-01 00:00:00 71 72 67.8 31.3 ## 2 2021-01-01 00:05:00 71 72 67.8 31.3 ## 3 2021-01-01 00:10:00 71 72 67.9 31.3 ## 4 2021-01-01 00:15:00 71 66 67.8 31.3 ## 5 2021-01-01 00:20:00 71 66 67.7 31.3 ## 6 2021-01-01 00:25:00 71 66 67.6 31.3 ## 7 2021-01-01 00:30:00 71 66 67.6 31.2 ## 8 2021-01-01 00:35:00 71 66 67.6 31.2 ## 9 2021-01-01 00:40:00 71 66 67.6 31.2 ## 10 2021-01-01 00:45:00 71 66 67.6 31.2 ## # … with 15,830 more rows Making this tsibble longer allows us to use the built-in grouping and iteration that comes with CCF() and other functions in {fable}. temp_settings_long <- temp_settings %>% tidyr::pivot_longer( cols=-c(time, Bedroom 2), names_to='Thermostat', values_to='Temperature' ) room_thermostat_ccf <- temp_settings_long %>% CCF(Temperature, Bedroom 2) room_thermostat_ccf %>% autoplot() + ggthemes::theme_few() This plot makes a strong argument for Bedroom 2 being more affected by the downstairs thermostat than the upstairs like we presume. But I don’t think the downstairs thermostat is actually controlling the vents in Bedroom 2 because I have looked at the duct work and followed the path (along with a professional HVAC technician) from the furnace to the vent in Bedroom 2. What I think is more likely is that the rooms downstairs get so cold (I lower the temperature overnight to conserve energy), and there is not much insulation between the floors, so the vent in Bedroom 2 can’t pump enough warm air to compensate. I did try to experiment for a few nights (February 12 and 18) by not lowering the downstairs temperature but the eyeball test didn’t reveal any impact. Perhaps a proper A/B test is called for. # Fitting Time Series Models with {fable} Going beyond cross-correlations, an ARIMA model with exogenous variables can give an indication if input variables have a significant impact on a time series while accounting for autocorrelation and seasonality. We want to model the temperature in Bedroom 2 while using all the other rooms, the outside temperature and the thermostat settings to see which affected Bedroom 2 the most. We fit a number of different models then compare them using AICc (AIC corrected) to see which fits the best. To do this we need a tsibble with each of these variables as a column. This requires us to join temp_settings with wide_temps. all_ts <- wide_temps %>% left_join( temp_settings %>% select(time, Upstairs, Downstairs), by='time' ) Now we use model() from {fable} (technically from {fabletools}) to fit a number of models. We make it run in parallel with plan() from {future}. library(future) plan(multisession, workers=10) library(tictoc) library(fable) tic() ts_mods <- all_ts %>% model( mod0=ARIMA(Bedroom 2) , mod1=ARIMA(Bedroom 2 ~ Bedroom 1) , mod2=ARIMA(Bedroom 2 ~ Bedroom 3) , mod3=ARIMA(Bedroom 2 ~ Bedroom 1 + Bedroom 3) , mod4=ARIMA(Bedroom 2 ~ Living Room) , mod5=ARIMA(Bedroom 2 ~ Playroom) , mod6=ARIMA(Bedroom 2 ~ Living Room + Playroom) , mod7=ARIMA(Bedroom 2 ~ Living Room + Playroom + Kitchen + Dining Room ) , mod8=ARIMA(Bedroom 2 ~ Bedroom 1 + Bedroom 3 + Living Room + Playroom + Kitchen + Dining Room ) , mod9=ARIMA(Bedroom 2 ~ Outdoors) , mod10=ARIMA(Bedroom 2 ~ Upstairs) , mod11=ARIMA(Bedroom 2 ~ Downstairs) , mod12=ARIMA(Bedroom 2 ~ Downstairs + Upstairs) , mod13=ARIMA(Bedroom 2 ~ Outdoors + Downstairs + Upstairs) , mod14=ARIMA(Bedroom 2 ~ Outdoors + Downstairs + Upstairs + Living Room + Playroom + Kitchen + Dining Room ) , mod15=ARIMA(Bedroom 2 ~ Outdoors + Downstairs + Upstairs + Bedroom 1 + Bedroom 3 ) , mod16=ARIMA(Bedroom 2 ~ Outdoors + Downstairs + Upstairs + Bedroom 1 + Bedroom 3 + Living Room + Playroom + Kitchen + Dining Room ) ) toc() ## 99.41 sec elapsed This results in a tibble of models (termed a mable) with one column per model. We use glance() to see the AICc for each, along with other information. ts_mods %>% glance() %>% arrange(AICc) ## # A tibble: 17 x 8 ## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> ## 1 mod3 0.0247 6723. -13427. -13427. -13358. <cpl [3]> <cpl [14]> ## 2 mod1 0.0247 6714. -13411. -13411. -13350. <cpl [3]> <cpl [14]> ## 3 mod15 0.0247 6709. -13394. -13394. -13302. <cpl [3]> <cpl [3]> ## 4 mod4 0.0257 6393. -12768. -12768. -12699. <cpl [3]> <cpl [25]> ## 5 mod2 0.0264 6199. -12381. -12381. -12320. <cpl [2]> <cpl [15]> ## 6 mod9 0.0264 6190. -12365. -12365. -12304. <cpl [2]> <cpl [26]> ## 7 mod10 0.0264 6190. -12365. -12365. -12303. <cpl [2]> <cpl [26]> ## 8 mod0 0.0264 6189. -12365. -12365. -12311. <cpl [2]> <cpl [26]> ## 9 mod13 0.0264 6176. -12331. -12331. -12255. <cpl [2]> <cpl [26]> ## 10 mod12 0.0264 6175. -12331. -12331. -12262. <cpl [2]> <cpl [26]> ## 11 mod11 0.0264 6174. -12331. -12331. -12270. <cpl [2]> <cpl [26]> ## 12 mod6 0.0195 4940. -9863. -9863. -9802. <cpl [5]> <cpl [0]> ## 13 mod5 0.0195 4932. -9849. -9849. -9796. <cpl [5]> <cpl [0]> ## 14 mod8 0.0163 4603. -9184. -9184. -9100. <cpl [4]> <cpl [0]> ## 15 mod16 0.0163 4605. -9181. -9181. -9074. <cpl [4]> <cpl [0]> ## 16 mod7 0.0171 4364. -8708. -8708. -8631. <cpl [5]> <cpl [0]> ## 17 mod14 0.0171 4365. -8704. -8704. -8605. <cpl [5]> <cpl [0]> ts_mods %>% glance() %>% # sort .model according to AICc for better plotting mutate(.model=forcats::fct_reorder(.model, .x=AICc, .fun=sort)) %>% # smaller is better for AICc # so negate it so that the tallest bar is best ggplot(aes(x=.model, y=-AICc)) + geom_col() + theme(axis.text.x=element_text(angle=25)) The best two models (as judged by AICc) are mod3 and mod1, some of the simpler models. The key thing they have in common is that Bedroom 1 is a predictor, which somewhat contradicts the findings from the cross-correlation function. We can examine mod3 by selecting it out of ts_mods then calling report(). ts_mods %>% select(mod3) %>% report() ## Series: Bedroom 2 ## Model: LM w/ ARIMA(3,1,2)(0,0,1)[12] errors ## ## Coefficients: ## ar1 ar2 ar3 ma1 ma2 sma1 Bedroom 1 ## 0.4497 -0.5184 0.0446 0.0650 0.3160 -0.0049 0.2723 ## s.e. 0.0593 0.0471 0.0265 0.0589 0.0274 0.0082 0.0082 ## Bedroom 3 ## 0.0396 ## s.e. 0.0092 ## ## sigma^2 estimated as 0.02469: log likelihood=6722.64 ## AIC=-13427.28 AICc=-13427.27 BIC=-13358.25 This shows that Bedroom 1 and Bedroom 3 are significant predictors with SARIMA(3,1,2)(0,0,1) errors. The third best model is one of the more complicated models, mod15. Not all the predictors are significant, though we should not judge a model, or the predictors, based on that. ts_mods %>% select(mod15) %>% report() ## Series: Bedroom 2 ## Model: LM w/ ARIMA(3,1,3) errors ## ## Coefficients: ## ar1 ar2 ar3 ma1 ma2 ma3 Outdoors Downstairs ## 0.1870 -0.4426 -0.0724 0.3288 0.3745 0.0886 -0.0037 -0.0015 ## s.e. 0.2351 0.0928 0.1058 0.2352 0.0490 0.0692 0.0026 0.0025 ## Upstairs Bedroom 1 Bedroom 3 ## 0.0087 0.2718 0.0387 ## s.e. 0.0066 0.0081 0.0092 ## ## sigma^2 estimated as 0.02468: log likelihood=6709.22 ## AIC=-13394.43 AICc=-13394.41 BIC=-13302.39 We can visualize this model with {coefplot} after stripping out the time series coefficients. library(coefplot) ts_mods %>% select(mod15) %>% coef() %>% filter(!stringr::str_detect(term, '^ar|ma\\d+$')) %>%
select(Coefficient=term, Value=estimate, SE=std.error, Model=.model) %>%
mutate(
LowOuter=Value - 2*SE, LowInner=Value - 1*SE,
HighInner=Value + 1*SE, HighOuter=Value + 2*SE
) %>%
arrange(Value) %>%
mutate(Coefficient=factor(Coefficient, levels=Coefficient)) %>%
coefplot()

What all these top performing models have in common is the inclusion of the other bedrooms as predictors.

While helping me interpret the data, my wife pointed out that all the colder rooms are on one side of the house while the warmer rooms are on the other side. That colder side is the one that gets hit by winds which are rather strong. This made me remember a long conversation I had with a very friendly and knowledgeable sales person at GasTec who said that wind can significantly impact a house’s heat due to it blowing away the thermal envelope. I wanted to test this idea, but while the Ecobee API returns data on wind, the value is always 0, so it doesn’t tell me anything. Hopefully, I can get that fixed.

# Running Everything with {targets}

There were many moving parts to the analysis, so I managed it all with {targets}. Each step seen in this post is a separate target in the workflow. There is even a target to build an html report from an R Markdown file. The key is to load objects created by the workflow into the R Markdown document with tar_read() or tar_load(). Then tar_render() will figure out how the R Markdown document figures into the computational graph.

tar_render(
results,
'Analysis.Rmd'
)

I went over {targets} more thoroughly in the previous post and I recommend Will Landau’s talk at the New York Open Statistical Programming Meetup to learn more.

# Reporting with RStudio Connect

In addition to the R Markdown I made a shiny dashboard using the R Markdown flexdashboard format. I host this on RStudio Connect so I can get a live view of the data and examine the models. Besides from being a partner with RStudio, I really love this product and I use it for all of my internal reporting (and side projects) and we have a whole practice around getting companies up to speed with the tool.

# What Comes Next?

So what can be done with this information? Based on the ARIMA model the other bedrooms on the same floor are predictive of Bedroom 2, which makes sense since they are controlled by the same thermostat. Bedroom 3 is consistently one of the coldest in the house and shares a wall with Bedroom 2, so maybe that wall could use some insulation. Bedroom 1 on the other hand is the hottest room on that floor, so my idea is to find a way to cool off that room (we sealed the vents with magnetic covers), thus lowering the average temperature on the floor, causing the heat to kick in, raising the temperature everywhere, including Bedroom 2.

The cross-correlation functions also suggest a relationship with Playroom and Living Room, which are some of the colder rooms downstairs and the former is directly underneath Bedroom 2. I’ve already put caulking cord in the gaps around the floorboards in the Playroom and that has noticeably raised the temperature in there, so hopefully that will impact Bedroom 2. We already have storm windows on just about every window and have added thermal curtains to Bedroom 2, Bedroom 3 and the Living Room. My wife and I also discovered air billowing in from around, not in, the fireplace in the Living Room, so if we can close that source of cold air, that room should get warmer then hopefully Bedroom 2 as well. Lastly, we want to insulate between the first floor ceiling and second floor.

While we make these fixes, we’ll keep collecting data and trying new ways to analyze it all. It will be interesting to see how the data change in the summer.

These three posts were intended to show the entire process from data collection to automation (what some people are calling robots) to analysis, and show how data can be used to make actionable decisions. This is one more way I’m improving my life with data.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

In a recent post I talked about collecting temperature data from different rooms in my house. Using {targets}, I am able to get temperature readings for any given day down to five-minute increments and store that data on a DigitalOcean space.

Since I want to be able to both fit models on the data and track temperatures close to real time I needed this to run on a regular basis, say every 30-60 minutes. I have a machine learning server that Kaz Sakamoto, Michael Beigelmacher and I built from parts we bought and a collection of Raspberry Pis, but I didn’t want to maintain the hardware for this, I wanted something that was handled for me.

Luckily GitHub Actions can run arbitrary code on a schedule and provides free compute for public repositories. So I built a Docker image and GitHub Actions configuration files to make it all happen. This all is based on the templogging repository on GitHub.

# Docker Image

To ensure everything works as expected and to prevent package updates from breaking code, I used a combination of {renv} and Docker.

{renv} allows each individual project to have its own set of packages isolated from all other projects’ packages. It also keeps track of package versions so that the project can be restored on other computers, or in a Docker container, in the exact same configuration.

Using {renv} is fairly automatic. When starting a new project run renv::init() to create the isolated package library and start tracking packages. Then packages are installed as usual with install.packages() or renv::install(). Then renv::snapshot() is periodically used to write the specific package versions to the renv.lock file which is saved in the root directory of the project.

Docker is like a lightweight virtual machine that can create a specific environment. For this project I wanted a specific version of R (4.0.3), a handful of system libraries like libxml2-dev (for {igraph}, which is needed for {targets}) and libcurl4-openssl-dev (for {httr}) and all of the packages tracked with {renv}.

The R version is provided by starting with the corresponding r-ver image from the rocker project. The system libraries are installed using apt install.

The R packages are installed using renv::restore() but this requires files from the project to be added to the image in the right order. First, the templogging directory is created to hold everything, then individual files needed for the project and {renv} are added to that directory: templogging.Rproj, .Rprofile and renv.lock. Then the entire renv directory is added as well. After that, the working directory is changed to the templogging directory so that the following commands take place in that location.

Then comes time to install all the packages tracked in renv.lock. This is done with RUN Rscript -e "renv::restore(repos='https://packagemanager.rstudio.com/all/__linux__/focal/latest', confirm=FALSE)". Rscript -e runs R code as if it were executed in an R session. renv::restore() installs all the specific packages. Setting repos='https://packagemanager.rstudio.com/all/__linux__/focal/latest' specifies to use prebuilt binaries from the public RStudio Package Manager (some people need the commercial, private version and if you think this may be you, my company, Lander Analytics can help you get started).

After the packages are installed, the rest of the files in the project are copied into the image with ADD . /templogging. It is important that this step takes place after the packages are installed so that changes to code does not trigger a reinstall of the packages.

This is the complete Dockerfile.

FROM rocker/r-ver:4.0.3

# system libraries
RUN apt update && \
apt install  -y --no-install-recommends \
# for igraph
libxml2-dev \
libglpk-dev \
libgmp3-dev \
# for httr
libcurl4-openssl-dev \
libssl-dev && \
# makes the image smaller
rm -rf /var/lib/apt/lists/*

# create project directory
RUN mkdir templogging
# add all of the renv folder, except the library which is marked out in .dockerignore

# make sure we are in the project directory so restoring the packages works cleanly
WORKDIR /templogging

# this restores the desired packages (including specific versions)
# it uses the public RStudio package manager to get binary versions of the packages
# this is for faster installation
RUN Rscript -e "renv::restore(repos='https://packagemanager.rstudio.com/all/__linux__/focal/latest', confirm=FALSE)"

# then we add in the rest of the project folder, including  all the code
# we do this separately so that we can change code without having to reinstall all the packages
ADD . /templogging

To make building and running the image easier I made a docker-compose.yml file.

version: '3.4'

services:
templogging:
image: jaredlander/templogging
container_name: tempcheck
build:
context: .
dockerfile: ./Dockerfile
environment:
- BUCKET_NAME=${BUCKET_NAME} - FOLDER_NAME=${FOLDER_NAME}
- AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
- AWS_DEFAULT_REGION=${AWS_DEFAULT_REGION} - AWS_S3_ENDPOINT=${AWS_S3_ENDPOINT}
- AWS_SESSION_TOKEN=${AWS_SESSION_TOKEN} - TZ=${TZ}

So we can now build and upload the image with the following shell code.

docker-compose build
docker push jaredlander/templogging:latest

Now the image is available to be used anywhere. Since this image, and all the code are public, it’s important that nothing personal or private was saved in the code. As can be seen in the docker-compose.yml file or any of the R code in the repo, all potentially sensitive information was stored in environment variables.

The {targets} workflow can be executed inside the Docker container with the following shell command.

docker-compose run templogging R -e "targets::tar_make()"

That will get and write data for the current date. To run this for a specific date, a variable called process_date should be set to that date (assumes the TZ environment variable has been set) and tar_make() should be called with the callr_function argument set to NULL. I’m sure there is a better way to set arguments to make tar_make() of runtime settings, but I haven’t figured it out yet.

docker-compose run templogging R -e "process_date <- as.Date('2021-02-20')" -e "targets::tar_make(callr_function=NULL)"

# GitHub Actions

After the Docker image was built, I automated the whole process using GitHub Actions. This requires a yml file inside the .github/workflows folder in the repository.

The first section is on:. This tells GitHub to run this job whenever there is a push to the main or master branch and whenever there is a pull request to these branches. It also runs every 30 minutes thanks to the schedule block that has - cron: '*/30 * * * *'. The */30 means “every thirty” and it’s the first position which means minutes (0-59). The second position would be hours (0-23). The following positions are day of month (1-31), month of year (1-12), day of the week (1-7, with 1 as Monday) and year (1900-3000).

The next section, name:, just provides the name of the workflow.

Then comes the jobs: section. Multiple jobs can be run, but there is only one job called Run-Temp-Logging.

The job runs on (runs-on:) ubuntu-latest as opposed to Windows or macOS.

This is going to use the Docker container so the container: block specifies an image: and environment variables (env:). The environment variables are stored securely in GitHub and are accessed via ${{ secrets.VARIABLE_NAME }}. Since I created this I have been debating if it would be faster to install the packages directly in the virtual machine spun up by the actions runner than to use a Docker image. Given that R package installations can be cached it might work out, but I haven’t tested it yet. The steps: block runs the {targets} workflow. There can be multiple steps, though in this case there is only one, so each starts with a dash (-). For the step, a name: is given, along with what to run (run: R -e "targets::tar_make()") and the working directory (working-directory: /templogging), each on a separate line in the file. The R -e "targets::tar_make()" command runs in the Docker container with /templogging as the working directory. This has the same result as running docker-compose run templogging R -e "targets::tar_make()" locally. Each time this is run the file in the DigitalOcean space gets written anew. The entire yml file is below. on: push: branches: - main - master pull_request: branches: - main - master schedule: - cron: '*/30 * * * *' name: Run-Targets jobs: Run-Temp-Logging: runs-on: ubuntu-latest container: image: jaredlander/templogging:latest env: BUCKET_NAME:${{ secrets.BUCKET_NAME }}
FOLDER_NAME: ${{ secrets.FOLDER_NAME }} AWS_ACCESS_KEY_ID:${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} AWS_DEFAULT_REGION:${{ secrets.AWS_DEFAULT_REGION }}
AWS_S3_ENDPOINT: ${{ secrets.AWS_S3_ENDPOINT }} AWS_SESSION_TOKEN: "" ECOBEE_API_KEY:${{ secrets.ECOBEE_API_KEY }}
ECOBEE_REFRESH_TOKEN: ${{ secrets.ECOBEE_REFRESH_TOKEN }} steps: - name: Run targets workflow run: R -e "targets::tar_make()" working-directory: /templogging With this running every 30 minutes I can pull the data just about any time and have close-enough-to-live insight into my house. But let’s be honest, I really only need to see up to the past few days to make sure everything is all right, so a 30-minute delay is more than good enough. # End of Day While the job is scheduled to run every thirty minutes, I noticed that it wasn’t exact, so I couldn’t be sure I would capture the last few five-minute intervals of each day. So I made another GitHub Action to pull the prior day’s data at 1:10 AM. The instructions are mostly the same with just two changes. First, the schedule is - cron: '10 1 * * *' which means on the 10th minute of the first hour. Since the rest of the slots are * the job occurs on every one of them, so it is the 10th minute of the first hour of every day of every month of every year. The second change is the command given to the Docker container to run the {targets} workflow. As alluded to earlier, the _targets.R file is designed so that if Process_date is assigned a date, the data will be pulled for that date instead of the current date. So the run: portion has the command R -e "Sys.setenv(TZ='${{ secrets.TZ }}')" -e "process_date <- Sys.Date() - 1" -e "targets::tar_make(callr_function=NULL)". This makes sure the timezone environment variable is set in R, process_date gets the value of the prior day’s date and tar_make() is called.

The entire yml file is below.

on:
push:
branches:
- main
- master
pull_request:
branches:
- main
- master
schedule:
- cron: '10 1 * * *'

name: Run-Targets

jobs:
Run-Temp-Logging:
runs-on: ubuntu-latest
container:
image: jaredlander/templogging:latest
env:
BUCKET_NAME: ${{ secrets.BUCKET_NAME }} FOLDER_NAME:${{ secrets.FOLDER_NAME }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} AWS_SECRET_ACCESS_KEY:${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: ${{ secrets.AWS_DEFAULT_REGION }} AWS_S3_ENDPOINT:${{ secrets.AWS_S3_ENDPOINT }}
AWS_SESSION_TOKEN: ""
ECOBEE_API_KEY: ${{ secrets.ECOBEE_API_KEY }} ECOBEE_REFRESH_TOKEN:${{ secrets.ECOBEE_REFRESH_TOKEN }}
steps:
- name: Run targets workflow
run: R -e "Sys.setenv(TZ='${{ secrets.TZ }}')" -e "process_date <- Sys.Date() - 1" -e "targets::tar_make(callr_function=NULL)" working-directory: /templogging # What’s Next? Now that the data are being collected it’s time to analyze and see what is happening in the house. This is very involved so that will be the subject of the next post. Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone. ### The costs involved with a health insurance plan can be confusing so I perform an analysis of different options to find which plan is most cost effective My wife and I recently brought a new R programmer into our family so we had to update our health insurance. Becky is a researcher in neuroscience and psychology at NYU so we decided to choose an NYU insurance plan. For families there are two main plans: Value and Advantage. The primary differences between the plans are the following: Item Explanation Value Plan Amount Advantage Plan Amount Bi-Weekly Premiums The amount we pay every other week in order to have insurance$160 ($4,160 annually)$240 ($6,240 annually) Deductible Amount we pay directly to health providers before the insurance starts covering costs$1,000 $800 Coinsurance After the deductible is met, we pay this percentage of medical bills 20% 10% Out-of-Pocket Maximum This is the most we will have to pay to health providers in a year (premiums do not count toward this max)$6,000 $5,000 We put them into a tibble for use later. # use tribble() to make a quick and dirty tibble parameters <- tibble::tribble( ~Plan, ~Premiums, ~Deductible, ~Coinsurance, ~OOP_Maximum, 'Value', 160*26, 1000, 0.2, 6000, 'Advantage', 240*26, 800, 0.1, 5000 ) Other than these cost differences, there is not any particular benefit of either plan over the other. That means whichever plan is cheaper is the best to choose. This blog post walks through the steps of evaluating the plans to figure out which to select. Code is included so anyone can repeat, and improve on, the analysis for their given situation. # Cost In order to figure out which plan to select we need to figure out the all-in cost, which is a function of how much we spend on healthcare in a year (we have to estimate our annual spending) and the aforementioned premiums, deductible, coinsurance and out-of-pocket maximum. $\text{cost} = f(\text{spend}; \text{premiums}, \text{deductible}, \text{coinsurance}, \text{oop_maximum}) = \\ \text{min}(\text{oop_maximum}, \text{deductible} + \text{coinsurance}*(\text{spend}-\text{deductible}))+\text{premiums}$ This can be written as an R function like this. #' @title cost #' @description Given healthcare spend and other parameters, calculate the actual cost to the user #' @details Uses the formula above to caluclate total costs given a certain level of spending. This is the premiums plus either the out-of-pocket maximum, the actual spend level if the deductible has not been met, or the amount of the deductible plus the coinsurance for spend above the deductible but below the out-of-pocket maximum. #' @author Jared P. Lander #' @param spend A given amount of healthcare spending as a vector for multiple amounts #' @param premiums The annual premiums for a given plan #' @param deductible The deductible for a given plan #' @param coinsurance The coinsurance percentage for spend beyond the deductible but below the out-of-pocket maximum #' @param oop_maximum The maximum amount of money (not including premiums) that the insured will pay under a given plan #' @return The total cost to the insured #' @examples #' cost(3000, 4160, 1000, .20, 6000) #' cost(3000, 6240, 800, .10, 5000) #' cost <- function(spend, premiums, deductible, coinsurance, oop_maximum) { # spend is vectorized so we use pmin to get the min between oop_maximum and (deductible + coinsurance*(spend - deductible)) for each value of spend provided pmin( # we can never pay more than oop_maximum so that is one side oop_maximum, # if we are under oop_maximum for a given amount of spend, # this is the cost pmin(spend, deductible) + coinsurance*pmax(spend - deductible, 0) ) + # we add the premiums since that factors into our cost premiums } With this function we can see if one plan is always, or mostly, cheaper than the other plan and that’s the one we would choose. # R Packages For the rest of the code we need these R packages. library(dplyr) library(ggplot2) library(tidyr) library(formattable) library(readr) library(readxl) # Spending To see our out-of-pocket cost at varying levels of healthcare spend we build a grid in$1,000 increments from $1,000 to$70,000.

spending <- tibble::tibble(Spend=seq(1000, 70000, by=1000))

We call our cost function on each amount of spend for the Value and Advantage plans.

spending <- spending %>%
# use our function to calcuate the cost for the value plan
mutate(Value=cost(
spend=Spend,
premiums=parameters$Premiums[1], deductible=parameters$Deductible[1],
coinsurance=parameters$Coinsurance[1], oop_maximum=parameters$OOP_Maximum[1]
)
) %>%
# use our function to calcuate the cost for the Advantage plan
spend=Spend,
premiums=parameters$Premiums[2], deductible=parameters$Deductible[2],
coinsurance=parameters$Coinsurance[2], oop_maximum=parameters$OOP_Maximum[2]
)
) %>%
# compute the difference in costs for each plan
# the winner for a given amount of spend is the cheaper plan
mutate(Winner=if_else(Advantage < Value, 'Advantage', 'Value'))

The results are in the following table, showing every other row to save space. The Spend column is a theoretical amount of spending with a red bar giving a visual sense for the increasing amounts. The Value and Advantage columns are the corresponding overall costs of the plans for the given amount of Spend. The Difference column is the result of AdvantageValue where positive numbers in blue mean that the Value plan is cheaper while negative numbers in red mean that the Advantage plan is cheaper. This is further indicated in the Winner column which has the corresponding colors.

$2,000$5,360 $7,160 1800 Value$4,000 $5,760$7,360 1600 Value
$6,000$6,160 $7,560 1400 Value$8,000 $6,560$7,760 1200 Value
$10,000$6,960 $7,960 1000 Value$12,000 $7,360$8,160 800 Value
$14,000$7,760 $8,360 600 Value$16,000 $8,160$8,560 400 Value
$18,000$8,560 $8,760 200 Value$20,000 $8,960$8,960 0 Value
$22,000$9,360 $9,160 -200 Advantage$24,000 $9,760$9,360 -400 Advantage
$26,000$10,160 $9,560 -600 Advantage$28,000 $10,160$9,760 -400 Advantage
$30,000$10,160 $9,960 -200 Advantage$32,000 $10,160$10,160 0 Value
$34,000$10,160 $10,360 200 Value$36,000 $10,160$10,560 400 Value
$38,000$10,160 $10,760 600 Value$40,000 $10,160$10,960 800 Value
$42,000$10,160 $11,160 1000 Value$44,000 $10,160$11,240 1080 Value
$46,000$10,160 $11,240 1080 Value$48,000 $10,160$11,240 1080 Value
$50,000$10,160 $11,240 1080 Value$52,000 $10,160$11,240 1080 Value
$54,000$10,160 $11,240 1080 Value$56,000 $10,160$11,240 1080 Value
$58,000$10,160 $11,240 1080 Value$60,000 $10,160$11,240 1080 Value
$62,000$10,160 $11,240 1080 Value$64,000 $10,160$11,240 1080 Value
$66,000$10,160 $11,240 1080 Value$68,000 $10,160$11,240 1080 Value
$70,000$10,160 $11,240 1080 Value Of course, plotting often makes it easier to see what is happening. spending %>% select(Spend, Value, Advantage) %>% # put the plot in longer format so ggplot can set the colors gather(key=Plan, value=Cost, -Spend) %>% ggplot(aes(x=Spend, y=Cost, color=Plan)) + geom_line(size=1) + scale_x_continuous(labels=scales::dollar) + scale_y_continuous(labels=scales::dollar) + scale_color_brewer(type='qual', palette='Set1') + labs(x='Healthcare Spending', y='Out-of-Pocket Costs') + theme( legend.position='top', axis.title=element_text(face='bold') ) It looks like there is only a small window where the Advantage plan is cheaper than the Value plan. This will be more obvious if we draw a plot of the difference in cost. spending %>% ggplot(aes(x=Spend, y=Difference, color=Winner, group=1)) + geom_hline(yintercept=0, linetype=2, color='grey50') + geom_line(size=1) + scale_x_continuous(labels=scales::dollar) + scale_y_continuous(labels=scales::dollar) + labs( x='Healthcare Spending', y='Difference in Out-of-Pocket Costs Between the Two Plans' ) + scale_color_brewer(type='qual', palette='Set1') + theme( legend.position='top', axis.title=element_text(face='bold') ) To calculate the exact cutoff points where one plan becomes cheaper than the other plan we have to solve for where the two curves intersect. Due to the out-of-pocket maximums the curves are non-linear so we need to consider four cases. 1. The spending exceeds the point of maximum out-of-pocket spend for both plans 2. The spending does not exceed the point of maximum out-of-pocket spend for either plan 3. The spending exceeds the point of maximum out-of-pocket spend for the Value plan but not the Advantage plan 4. The spending exceeds the point of maximum out-of-pocket spend for the Advantage plan but not the Value plan When the spending exceeds the point of maximum out-of-pocket spend for both plans the curves are parallel so there will be no cross over point. When the spending does not exceed the point of maximum out-of-pocket spend for either plan we set the cost calculations (not including the out-of-pocket maximum) for each plan equal to each other and solve for the amount of spend that creates the equality. To keep the equations smaller we use variables such as $$d_v$$ for the Value plan deductible, $$c_a$$ for the Advantage plan coinsurance and $$oop_v$$ for the out-of-pocket maximum for the Value plan. $d_v + c_v(S – d_v) + p_v = d_a + c_a(S – d_a) + p_a \\ c_v(S – D_v) – c_a(S-d_a) = d_a – d_v + p_a – p_v \\ c_vS – c_vd_v – c_aS + c_ad_a = d_a – d_v + p_a – p_v \\ S(c_v – c_a) = d_a – c_ad_a – d_v + c_vd_v + p_a – p_v \\ S(c_v – c_a) = d_a(1 – c_a) – d_v(1 – c_v) + p_a – p_v \\ S = \frac{d_a(1 – c_a) – d_v(1 – c_v) + p_a – p_v}{(c_v – c_a)}$ When the spending exceeds the point of maximum out-of-pocket spend for the Value plan but not the Advantage plan, we set the out-of-pocket maximum plus premiums for the Value plan equal to the cost calculation of the Advantage plan. $oop_v + p_v = d_a + c_a(S – d_a) + p_a \\ d_a + c_a(S – d_a) + p_a = oop_v + p_v \\ c_aS – c_ad_a = oop_v + p_v – p_a – d_a \\ c_aS = oop_v + p_v – p_a + c_ad_a – d_a \\ S = \frac{oop_v + p_v – p_a + c_ad_a – d_a}{c_a}$ When the spending exceeds the point of maximum out-of-pocket spend for the Advantage plan but not the Value plan, the solution is just the opposite of the previous equation. $oop_a + p_a = d_v + c_v(S – d_v) + p_v \\ d_v + c_v(S – d_v) + p_v = oop_a + p_a \\ c_vS – c_vd_v = oop_a + p_a – p_v – d_v \\ c_vS = oop_a + p_a – p_v + c_vd_v – d_v \\ S = \frac{oop_a + p_a – p_v + c_vd_v – d_v}{c_v}$ As an R function it looks like this. #' @title calculate_crossover_points #' @description Given healthcare parameters for two plans, calculate when one plan becomes more expensive than the other. #' @details Calculates the potential crossover points for different scenarios and returns the ones that are true crossovers. #' @author Jared P. Lander #' @param premiums_1 The annual premiums for plan 1 #' @param deductible_1 The deductible plan 1 #' @param coinsurance_1 The coinsurance percentage for spend beyond the deductible for plan 1 #' @param oop_maximum_1 The maximum amount of money (not including premiums) that the insured will pay under plan 1 #' @param premiums_2 The annual premiums for plan 2 #' @param deductible_2 The deductible plan 2 #' @param coinsurance_2 The coinsurance percentage for spend beyond the deductible for plan 2 #' @param oop_maximum_2 The maximum amount of money (not including premiums) that the insured will pay under plan 2 #' @return The amount of spend at which point one plan becomes more expensive than the other #' @examples #' calculate_crossover_points( #' 160, 1000, 0.2, 6000, #' 240, 800, 0.1, 5000 #' ) #' calculate_crossover_points <- function( premiums_1, deductible_1, coinsurance_1, oop_maximum_1, premiums_2, deductible_2, coinsurance_2, oop_maximum_2 ) { # calculate the crossover before either has maxed out neither_maxed_out <- (premiums_2 - premiums_1 + deductible_2*(1 - coinsurance_2) - deductible_1*(1 - coinsurance_1)) / (coinsurance_1 - coinsurance_2) # calculate the crossover when one plan has maxed out but the other has not one_maxed_out <- (oop_maximum_1 + premiums_1 - premiums_2 + coinsurance_2*deductible_2 - deductible_2) / coinsurance_2 # calculate the crossover for the reverse other_maxed_out <- (oop_maximum_2 + premiums_2 - premiums_1 + coinsurance_1*deductible_1 - deductible_1) / coinsurance_1 # these are all possible points where the curves cross all_roots <- c(neither_maxed_out, one_maxed_out, other_maxed_out) # now calculate the difference between the two plans to ensure that these are true crossover points all_differences <- cost(all_roots, premiums_1, deductible_1, coinsurance_1, oop_maximum_1) - cost(all_roots, premiums_2, deductible_2, coinsurance_2, oop_maximum_2) # only when the difference between plans is 0 are the curves truly crossing all_roots[all_differences == 0] } We then call the function with the parameters for both plans we are considering. crossovers <- calculate_crossover_points( parameters$Premiums[1], parameters$Deductible[1], parameters$Coinsurance[1], parameters$OOP_Maximum[1], parameters$Premiums[2], parameters$Deductible[2], parameters$Coinsurance[2], parameters$OOP_Maximum[2] ) crossovers ## [1] 20000 32000 We see that the Advantage plan is only cheaper than the Value plan when spending between$20,000 and $32,000. The next question is will our healthcare spending fall in that narrow band between$20,000 and $32,000 where the Advantage plan is the cheaper option? # Probability of Spending This part gets tricky. I’d like to figure out the probability of spending between$20,000 and $32,000. Unfortunately, it is not easy to find healthcare spending data due to the opaque healthcare system. So I am going to make a number of assumptions. This will likely violate a few principles, but it is better than nothing. Assumptions and calculations: • Healthcare spending follows a log-normal distribution • We will work with New York State data which is possibly different than New York City data • We know the mean for New York spending in 2014 • We will use the accompanying annual growth rate to estimate mean spending in 2019 • We have the national standard deviation for spending in 2009 • In order to figure out the standard deviation for New York, we calculate how different the New York mean is from the national mean as a multiple, then multiply the national standard deviation by that number to approximate the New York standard deviation in 2009 • We use the growth rate from before to estimate the New York standard deviation in 2019 First, we calculate the mean. The Centers for Medicare & Medicaid Services has data on total and per capita medical expenditures by state from 1991 to 2014 and includes the average annual percentage growth. Since the data are bundled in a zip with other files, I posted them on my site for easy access. spend_data_url <- 'https://jaredlander.com/data/healthcare_spending_per_capita_1991_2014.csv' health_spend <- read_csv(spend_data_url) We then take just New York spending for 2014 and multiply it by the corresponding growth rate. ny_spend <- health_spend %>% # get just New York filter(State_Name == 'New York') %>% # this row holds overall spending information filter(Item == 'Personal Health Care ($)') %>%
# we only need a few columns
select(Y2014, Growth=Average_Annual_Percent_Growth) %>%
# we have to calculate the spending for 2019 by accounting for growth
# after converting it to a percentage
mutate(Y2019=Y2014*(1 + (Growth/100))^5)

ny_spend
Y2014 Growth Y2019
9778 5 12479.48

The standard deviation is trickier. The best I can find was the standard deviation on the national level in 2009. In 2013 the Centers for Medicare & Medicaid Services wrote in Volume 3, Number 4 of Medicare & Medicaid Research Review an article titled Modeling Per Capita State Health Expenditure Variation: State-Level Characteristics Matter. Exhibit 2 shows that the standard deviation of healthcare spending was $1,241 for the entire country in 2009. We need to estimate the New York standard deviation from this and then account for growth into 2019. Next, we figure out the difference between the New York State spending mean and the national mean as a multiple. nation_spend <- health_spend %>% filter(Item == 'Personal Health Care ($)') %>%
filter(Region_Name == 'United States') %>%
pull(Y2009)

ny_multiple <- ny_spendY2014/nation_spend ny_multiple ## [1] 1.418746 We see that the New York average is 1.4187464 times the national average. So we multiply the national standard deviation from 2009 by this amount to estimate the New York State standard deviation and assume the same annual growth rate as the mean. Recall, we can multiply the standard deviation by a constant. \begin{align} \text{var}(x*c) &= c^2*\text{var}(x) \\ \text{sd}(x*c) &= c*\text{sd}(x) \end{align} ny_spend <- ny_spend %>% mutate(SD2019=1241*ny_multiple*(1 + (Growth/100))^10) ny_spend Y2014 Growth Y2019 SD2019 9778 5 12479.48 2867.937 My original assumption was that spending would follow a normal distribution, but New York’s resident agricultural economist, JD Long, suggested that the spending distribution would have a floor at zero (a person cannot spend a negative amount) and a long right tail (there will be many people with lower levels of spending and a few people with very high levels of spending), so a log-normal distribution seems more appropriate. $\text{spending} \sim \text{lognormal}(\text{log}(12479), \text{log}(2868)^2)$ Visualized it looks like this. draws <- tibble( Value=rlnorm( n=1200, meanlog=log(ny_spendY2019),
sdlog=log(ny_spend$SD2019) ) ) ggplot(draws, aes(x=Value)) + geom_density() + xlim(0, 75000) We can see that there is a very long right tail which means there are many low values and few high values. Then the probability of spending between$20,000 and $32,000 can be calculated with plnorm(). plnorm(crossovers[2], meanlog=log(ny_spend$Y2019), sdlog=log(ny_spend$SD2019)) - plnorm(crossovers[1], meanlog=log(ny_spend$Y2019), sdlog=log(ny_spend$SD2019)) ## [1] 0.02345586 So we only have a 2.35% probability of our spending falling in that band where the Advantage plan is more cost effective. Meaning we have a 97.65% probability that the Value plan will cost less over the course of a year. We can also calculate the expected cost under each plan. We do this by first calculating the probability of spending each (thousand) dollar amount (since the log-normal is a continuous distribution this is an estimated probability). We multiply each of those probabilities against their corresponding dollar amounts. Since the distribution is log-normal we need to exponentiate the resulting number. The data are on the thousands scale, so we multiply by 1000 to put it back on the dollar scale. Mathematically it looks like this. $\mathbb{E}_{\text{Value}} \left[ \text{cost} \right] = 1000*\text{exp} \left\{ \sum p(\text{spend})*\text{cost}_{\text{Value}} \right\} \\ \mathbb{E}_{\text{Advantage}} \left[ \text{cost} \right] = 1000*\text{exp} \left\{ \sum p(\text{spend})*\text{cost}_{\text{Advantage}} \right\}$ The following code calculates the expected cost for each plan. spending %>% # calculate the point-wise estimated probabilities of the healthcare spending # based on a log-normal distribution with the appropriate mean and standard deviation mutate( SpendProbability=dlnorm( Spend, meanlog=log(ny_spend$Y2019),
sdlog=log(ny_spend$SD2019) ) ) %>% # compute the expected cost for each plan # and the difference between them summarize( ValueExpectedCost=sum(Value*SpendProbability), AdvantageExpectedCost=sum(Advantage*SpendProbability), ExpectedDifference=sum(Difference*SpendProbability) ) %>% # exponentiate the numbers so they are on the original scale mutate_each(funs=exp) %>% # the spending data is in increments of 1000 # so multiply by 1000 to get them on the dollar scale mutate_each(funs=~ .x * 1000) ValueExpectedCost AdvantageExpectedCost ExpectedDifference 5422.768 7179.485 1323.952 This shows that overall the Value plan is cheaper by about$1,324 dollars on average.

# Conclusion

We see that there is a very small window of healthcare spending where the Advantage plan would be cheaper, and at most it would be about $600 cheaper than the Value plan. Further, the probability of falling in that small window of savings is just 2.35%. So unless our spending will be between$20,000 and $32,000, which it likely will not be, it is a better idea to choose the Value plan. Since the Value plan is so likely to be cheaper than the Advantage plan I wondered who would pick the Advantage plan. Economist Jon Hersh invokes behavioral economics to explain why people may select the Advantage plan. Some parts of the Advantage plan are lower than the Value plan, such as the deductible, coinsurance and out-of-pocket maximum. People see that under certain circumstances the Advantage plan would save them money and are enticed by that, not realizing how unlikely that would be. So they are hedging against a low probability situation. (A consideration I have not accounted for is family size. The number of members in a family can have a big impact on the overall spend and whether or not it falls into the narrow band where the Advantage plan is cheaper.) In the end, the Value plan is very likely going to be cheaper than the Advantage plan. # Try it at Home I created a Shiny app to allow users to plug in the numbers for their own plans. It is rudimentary, but it gives a sense for the relative costs of different plans. # Thanks A big thanks to Jon Hersh, JD Long, Kaz Sakamoto, Rebecca Martin and Adam Hogan for reviewing this post. Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone. After four sold-out years in New York City, the R Conference made its debut in Washington DC to a sold-out crowd of data scientists at the Ronald Reagan Building on November 8th & 9th. Our speakers shared presentations on a variety of R-related topics. A big thank you to our speakers Max Kuhn, Emily Robinson, Mike Powell, Mara Averick, Max Richman, Stephanie Hicks, Michael Garris, Kelly O’Briant, David Smith, Anna Kirchner, Roger Peng, Marck Vaisman, Soumya Kalra, Jonathan Hersh, Vivian Peng, Dan Chen, Catherine Zhou, Jim Klucar, Lizzy Huang, Refael Lav, Ami Gates, Abhijit Dasgupta, Angela Li and Tommy Jones. # Some highlights from the conference: ## R Superstars Mara Averick, Roger Peng and Emily Robinson A hallmark of our R conferences is that the speakers hang out with all the attendees and these three were crowd favorites. ## Michael Powell Brings R to the aRmy Major Michael Powell describes how R has brought efficiency to the Army Intelligence and Security Command by getting analysts out of Excel and into the Tidyverse. “Let me turn those 8 hours into 8 seconds for you,” says Powell. ## Max Kuhn Explains the Applications of Equivocals to Apply Levels of Certainty to Predictions After autographing his book, Applied Predictive Modeling, for a lucky attendee, Max Kuhn explains how Equivocals can be applied to individual predictions in order to avoid reporting predictions when there is significant uncertainty. ## NYR and DCR Speaker Emily Robinson Getting an NYR Hoodie for her Awesome Tweeting Emily Robinson tweeted the most at the 2018 NYR conference, winning her a WASD mechanical keyboard and at DCR she came in second so we gave her a limited edition NYR hoodie. ## Max Richman Shows How SQL and R can Co-Exist Max Richman, wearing the same shirt he wore when he spoke at the first NYR, shows parallels between dplyr and SQL. ## Michael Garris Tells the Story of the MNIST Dataset Michael Garris was a member of the team that built the original MNIST dataset, which set the standard for handwriting image classification in the early 1990s. This talk may have been the first time the origin story was ever told. ## R Stats Luminary Roger Peng Explains Relationship Between Air Pollution and Public Health Roger Peng shows us how air pollution levels has fallen over the past 50 years resulting in dramatic improvements in air quality and health (with help from R). ## Kelly O’Briant Combining R with Serverless Computing Kelly O’Briant demonstrates how to easily deploy R projects on Google Compute Engine and promoted the new #radmins hashtag. ## Hot Dog vs Not Hot Dog by David Smith (Inspired by Jian-Yang from HBO’s Silicon Valley) David Smith, one of the original R users, shows how to recreate HBO’s Silicon Valley’s Not Hot Dog app using R and Azure ## Jon Hersh Describes How to Push for Data Science Within Your Organization Jon Hersh discusses the challenges, and solutions, of getting organizations to embrace data science. ## Vivian Peng and the Importance of Data Storytelling Vivian Peng asks the question, how do we protect the integrity of our data analysis when it’s published for the world to see? ## Dan Chen Signs His Book for David Smith Dan Chen autographing a copy of his book, Pandas for Everyone, for David Smith. Now David Smith has to sign his book, An Introduction to R, for Dan. ## Malorie Hughes Analyzing Tweets On the first day I challenged the audience to analyze the tweets from the conference and Malorie Hughes, a data scientist with NPR, designed a Twitter analytics dashboard to track the attendee with the most tweets with the hashtag #rstatsdc. Seth Wenchel won a WASD keyboard for the best tweeting. And we presented Malorie wit a DCR speaker mug. ## Strong Showing from the #RLadies! The #rladies group is growing year after year and it is great seeing them in force at NYR and DCR! ## Packages Matthew Hendrickson, a DCR attendee, posted on twitter every package mentioned during the two-day conference: ## Data Community DC A special thanks to the Data Community DC for helping us make the DC R Conference an incredible experience. ## Videos The videos for the conference will be posted in the coming weeks to dc.rstats.ai. ## See You Next Year Looking forward to more great conferences at next year’s NYR and DCR! Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone. ### Highlights from the 2016 New York R Conference Originally posted on www.work-bench.com. You might be asking yourself, “How was the 2016 New York R Conference?” Well, if we had to sum it up in one picture, it would look a lot like this (thank you to Drew Conway for the slide & delivering the battle cry for data science in NYC): Our 2nd annual, sold-out New York R Conference was back this year on April 8th & 9th at Work-Bench. Co-hosted with our friends at Lander Analytics, this year’s conference was bigger and better than ever, with over 250 attendees, and speakers from Airbnb, AT&T, Columbia University, eBay, Etsy, RStudio, Socure, and Tamr. In case you missed the conference or want to relive the excitement, all of the talks and slides are now live on the R Conference website. With 30 talks, each 20 minutes long and two forty-minute keynotes, the topics of the presentations were just as diverse as the speakers. Vivian Peng gave an emotional talk on data visualization using non-visual senses and “The Feels.” Bryan Lewis measured the shadows of audience members to demonstrate the pros and cons of projection methods, and Daniel Lee talked about life, love, Stan, and March Madness. But, even with 32 presentations from a diverse selection of speakers, two dominant themes emerged: 1) Community and 2) Writing better code. Given the amazing caliber of speakers and attendees, community was on everyone’s mind from the start. Drew Conway emoted the past, present, and future of data science in NYC, and spoke to the dangers of tearing down the tent we built. Joe Rickert from Microsoft discussed the R Consortium and how to become involved. Wes McKinney talked about community efforts in improving interoperability between data science languages with the new Feather data frame file format under the Apache Arrow project. Elena Grewal discussed how Airbnb’s data science team made changes to the hiring process to increase the number of female hires, and Andrew Gelman even talked about how your political opinions are shaped by those around you in his talk about Social Penumbras. Writing better code also proved to be a dominant theme throughout the two day conference. Dan Chen of Lander Analytics talked about implementing tests in R. Similarly, Neal Richardson and Mike Malecki of Crunch.io talked about how they learned to stop munging and love tests, and Ben Lerner discussed how to optimize Python code using profilers and Cython. The perfect intersection of themes came from Bas van Schaik of Semmle who discussed how to use data science to write better code by treating code as data. While everyone had some amazing insights, these were our top five highlights: ### JJ Allaire Releases a New Preview of RStudio JJ Allaire, the second speaker of the conference, got the crowd fired up by announcing new features of RStudio and new packages. Particularly exciting was bookdown for authoring large documents, R Notebooks for interactive Markdown files and shared sessions so multiple people can code together from separate computers. ### Andrew Gelman Discusses the Political Impact of the Social Penumbra As always, Dr. Andrew Gelman wowed the crowd with his breakdown of how political opinions are shaped by those around us. He utilized his trademark visualizations and wit to convey the findings of complex models. ### Vivian Peng Helps Kick off the Second Day with a Punch to the Gut On the morning of the second day of the conference, Vivian Peng gave a heartfelt talk on using data visualization and non-visual senses to drive emotional reaction and shape public opinion on everything from the Syrian civil war to drug resistance statistics. ### Ivor Cribben Studies Brain Activity with Time Varying Networks University of Alberta Professor Ivor Cribben demonstrated his techniques for analyzing fMRI data. His use of network graphs, time series and extremograms brought an academic rigor to the conference. ### Elena Grewal Talks About Scaling Data Science at Airbnb After a jam-packed 2 full days, Elena Grewal helped wind down the conference with a thoughtful introspection on how Airbnb has grown their data science team from 5 to 70 people, with a focus on increasing diversity and eliminating bias in the hiring process. See the full conference videos & presentations below, and sign up for updates for the 2017 New York R Conference on www.rstats.nyc. To get your R fix in the meantime, follow @nyhackr, @Work_Bench, and @rstatsnyc on Twitter, and check out the New York Open Programming Statistical Meetup or one of Work-Bench’s upcoming events! Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone. Last year, as I embarked on my NFL sports statistics work, I attended the Sloan Sports Analytics Conference for the first time. A year later, after a very successful draft, I was invited to present an R workshop to the conference. My time slot was up against Nate Silver so I didn’t expect many people to attend. Much to my surprise when I entered the room every seat was taken, people were lining the walls and sitting in the aisles. My presentation, which was unrelated to the work I did, analyzed the Giants’ probability of passing versus rushing and the probability of which receiver was targeted. It is available at the talks section of my site. After the talk I spent the rest of the day fielding questions and gave away copies of R for Everyone and an NYC Data Mafia shirt. Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone. Earlier this week, my company, Lander Analytics, organized our first public Bayesian short course, taught by Andrew Gelman, Bob Carpenter and Daniel Lee. Needless to say the class sold out very quickly and left a long wait list. So we will schedule another public training (exactly when tbd) and will make the same course available for private training. This was the first time we utilized three instructors (as opposed to a main instructor and assistants which we often use for large classes) and it led to an amazing dynamic. Bob laid the theoretical foundation for Markov chain Monte Carlo (MCMC), explaining both with math and geometry, and discussed the computational considerations of performing simulation draws. Daniel led the participants through hands-on examples with Stan, covering everything from how to describe a model, to efficient computation to debugging. Andrew gave his usual, crowd dazzling performance use previous work as case studies of when and how to use Bayesian methods. It was an intensive three days of training with an incredible amount of information. Everyone walked away knowing a lot more about Bayes, MCMC and Stan and eager to try out their new skills, and an autographed copy of Andrew’s book, BDA3. A big help, as always was Daniel Chen who put in so much effort making the class run smoothly from securing the space, physically moving furniture and running all the technology. Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone. On April 24th and 25th Lander Analytics and Work-Bench coorganized the (sold-out) inaugural New York R Conference. It was an amazing weekend of nerding out over R and data, with a little Python and Julia mixed in for good measure. People from all across the R community gathered to see rockstars discuss their latest and greatest efforts. Highlights include: Bryan Lewis wowing the crowd (there were literally gasps) with rthreejs implemented with htmlwidgets. Hilary Parker receiving spontaneous applause in the middle of her talk about reproducible research at Etsy for her explainr, catsplainr and mansplainr packages. James Powell speaking flawless Mandarin in a talk tangentially about Python. Vivian Peng also receiving spontaneous applause for her discussion of storytelling with data. Wes McKinney showing love for data.frames in all languages and sporting an awesome R t-shirt. Dan Chen using Shiny to study Ebola data. Andrew Gelman blowing away everyone with his keynote about Bayesian methods with particular applications in politics. Videos of the talks are available at http://www.rstats.nyc/#speakers with slides being added frequently. A big thank you to sponsors RStudio, Revolution Analytics, DataKind, Pearson, Brewla Bars and Twillory. Next year’s conference is already being planned for April. To inquire about sponsoring or speaking please get in touch. Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone. The other night I attended a talk about the history of Brooklyn pizza at the Brooklyn Historical Society by Scott Wiener of Scott’s Pizza Tours. Toward the end, a woman stated she had a theory that pizza slice prices stay in rough lockstep with New York City subway fares. Of course, this is a well known relationship that even has its own Wikipedia entry, so Scott referred her to a New York Times article from 1995 that mentioned the phenomenon. However, he wondered if the preponderance of dollar slice shops has dropped the price of a slice below that of the subway and playfully joked that he wished there was a statistician in the audience. Naturally, that night I set off to calculate the current price of a slice in New York City using listings from MenuPages. I used R’s XML package to pull the menus for over 1,800 places tagged as “Pizza” in Manhattan, Brooklyn and Queens (there was no data for Staten Island or The Bronx) and find the price of a cheese slice. After cleaning up the data and doing my best to find prices for just cheese/plain/regular slices I found that the mean price was$2.33 with a standard deviation of $0.52 and a median price of$2.45. The base subway fare is $2.50 but is actually$2.38 after the 5% bonus for putting at least $5 on a MetroCard. So, even with the proliferation of dollar slice joints, the average slice of pizza ($2.33) lines up pretty nicely with the cost of a subway ride (\$2.38).

Taking it a step further, I broke down the price of a slice in Manhattan, Queens and Brooklyn. The vertical lines represented the price of a subway ride with and without the bonus.  We see that the price of a slice in Manhattan is perfectly right there with the subway fare.

MenuPages even broke down Queens Neighborhoods so we can have a more specific plot.