My R-using family have a very old house with uneven heating and cooling. We are going to replace the HVAC system soon and rather than just install a new system I wanted to make data-driven decisions about the replacement. Since I have a couple of ecobee Thermostats which have remote sensors, I figured there must be an API I can use to track the temperature in various rooms of the house.

A quick search turned up an API that was even better than I hoped. My initial plan was to call the API every five minutes and write the data to a database, either self hosted or managed by DigitalOcean. This would require my code to never miss a run because it would only capture the status at that moment, which also meant not being able to go back in time to correct things. Fortunately, the API can also give historic data for any period of time in five-minute increments and this is much more useful.

In this post I will go through the process of calling the API, storing the data and building a workflow using {targets}. In the next post I’ll cover automating the process using GitHub Actions. In the third post I’ll show how I analyzed the data using time series methods.

Accessing the API

Before we can use the API we need to sign up for a key, and have an ecobee thermostat, of course. Getting the API key is explained here, though last I checked we cannot complete the process with two factor authentication (2FA) turned on, so we should disable 2FA, register for the API, then re-enable 2FA.

Using the API requires the API key and an access token, but the access token only lasts for one hour. So we are given a refresh token, which is good for a year, and can be used to get a new access token. There is probably a better way to build the URL for {httr}, but this worked. Both the refresh token and API key will be strings of letters, numbers and symbols along the lines of ha786234h1q763h.

token_request <- httr::POST(
# it uses the token endpoint with parameters in the URL
url="https://api.ecobee.com/token?grant_type=refresh_token&code=ha786234h1q763h&client_id=kjdf837hw7384",
encode='json'
)

access_token <- httr::content(token_request)$access_token Next, we use the API to get information about our thermostats, particularly the IDs and the current times. The access token is used for bearer authentication, which is added as a header to the httr::get() call. All API requests from now on will need this access token. thermostat_info <- httr::GET( # the requesting URL # the request is to the thermostat endpoint with parameters passes in a json body that is part of the URL # location is needed for the timezone offset 'https://api.ecobee.com/1/thermostat?format=json&body={"selection":{"selectionType":"registered","selectionMatch":"","includeLocation":true}}' # authentication header # supplying a header with "Bearer access_token" tells the API who we are , httr::add_headers(Authorization=glue::glue('Bearer {access_token}')) # json is the chosen format , encode='json' ) %>% # extract the contact into a list httr::content() From this object we can get the thermostat IDs which will be useful for generating the report. thermostat_ids <- thermostat_info$thermostatList %>%
purrr::map_chr('identifier') %>%
# make it a single-element character vector that separates IDs with a comma
paste(collapse=',')

thermostat_ids
## [1] "28716,17611"

Given thermostat IDs we can request a report for a given time period. The report requires a startDate and endDate. By default the returned data are from midnight to 11:55 PM UTC. So a startInterval and endInterval shifts the time to the appropriate time zone which can be ascertained from thermostat_info$thermostatList[[1]]$location$timeZoneOffsetMinutes. First thing to note is that the intervals are in groups of five minutes, so in a 24-hour day there are 288 intervals. We then account for being east or west of UTC, and subtract out a fifth of the offset. The endInterval is just one less than this startInterval in order to completely wrap around the clock. Doing these calculations should get us the entire day in our time zone. timeOffset <- thermostat_info$thermostatList[[1]]$location$timeZoneOffsetMinutes
start_interval <- 287*(timeOffset > 0) - timeOffset/5
end_interval <- start_interval - 1

Now we build the request URL for the report. The body in the URL takes a number of parameters. We already have startDate, endDate, startInterval and endInterval. columns is a comma separated listing of the desired data points. For our purposes we want "zoneAveTemp,hvacMode,fan,outdoorTemp,outdoorHumidity,sky,wind,zoneClimate,zoneCoolTemp,zoneHeatTemp,zoneHvacMode,zoneOccupancy". The selection parameter takes two arguments: selectionType, which should be "thermostats" and selectionMatch, which is the comma-separated listing of thermostat IDs saved in thermostat_ids. We want data from the room sensors so we set includeSensors to true. Like in previous requests, this URL is built using glue::glue().

report <- httr::GET(
glue::glue(
# the request is to the runtimeReport endpoint with parameters passes in a json body that is part of the URL
'https://api.ecobee.com/1/runtimeReport?format=json&body={{"startDate":"{startDate}","endDate":"{endDate}","startInterval":{startInterval},"endInterval":{endInterval},"columns":"zoneAveTemp,hvacMode,fan,outdoorTemp,outdoorHumidity,sky,wind,zoneClimate,zoneCoolTemp,zoneHeatTemp,zoneHvacMode,zoneOccupancy","selection":{{"selectionType":"thermostats","selectionMatch":"{thermostats}"}},"includeSensors":true}}'
)
# authentication
, encode='json'
) %>%
httr::content()

Handling the Data

Now that we have the report object we need to turn it into a nice data.frame or tibble. There are two major components to the data: thermostat information and sensor information. Multiple sensors are associated with a thermostat, including a sensor in the thermostat itself. So if our house has two HVAC zones, and hence two thermostats, we would have two sets of this information. The thermostat information includes overall data such as date, time, average temperature (average reading for all the sensors associated with a thermostat), HVACMode (heat or cool), fan speed and outdoor temperature. The sensor information has readings from each sensor associated with each thermostat such as detected occupancy and temperature.

The thermostat level and sensor level data are kept in lists inside the report object and need to be handled separately.

Thermostat Level Data

The report object has an element called reportList, which is a list where each element represents a different thermostat. For a house with two thermostats this list will have a length of two. Each of these elements contains a list called rowList. This has as many elements as the number of intervals requested, 288 for a full day (five-minute intervals for 24 hours). All of these elements are character vectors of length one, with commas separating the values. A few examples are below.

report$reportList[[1]]$rowList
## [[1]]
## [1] "2021-01-25,00:00:00,70.5,heat,75,28.4,43,5,0,Sleep,71,71,heatOff,0"
##
## [[2]]
## [1] "2021-01-25,00:05:00,70.3,heat,300,28.4,43,5,0,Sleep,71,71,heatStage1On,0"
##
## [[3]]
## [1] "2021-01-25,00:10:00,70.2,heat,300,28.4,43,5,0,Sleep,71,71,heatStage1On,0"
##
## [[4]]
## [1] "2021-01-25,00:15:00,70.2,heat,300,28.4,43,5,0,Sleep,71,71,heatStage1On,0"
##
## [[5]]
## [1] "2021-01-25,00:20:00,70.3,heat,300,28.4,43,5,0,Sleep,71,71,heatStage1On,0"

We can combine all of these into a single vector with unlist().

unlist(report$reportList[[1]]$rowList)
## [1] "2021-01-25,00:00:00,70.5,heat,75,28.4,43,5,0,Sleep,71,71,heatOff,0"
## [2] "2021-01-25,00:05:00,70.3,heat,300,28.4,43,5,0,Sleep,71,71,heatStage1On,0"
## [3] "2021-01-25,00:10:00,70.2,heat,300,28.4,43,5,0,Sleep,71,71,heatStage1On,0"
## [4] "2021-01-25,00:15:00,70.2,heat,300,28.4,43,5,0,Sleep,71,71,heatStage1On,0"
## [5] "2021-01-25,00:20:00,70.3,heat,300,28.4,43,5,0,Sleep,71,71,heatStage1On,0"

The column names are stored in the columns element of the report object.

report$columns ## [1] "zoneAveTemp,HVACmode,fan,outdoorTemp,outdoorHumidity,sky,wind,zoneClimate,zoneCoolTemp,zoneHeatTemp,zoneHVACmode,zoneOccupancy" This can be split into a vector of names using strsplit(). strsplit(report$columns, split=',')[[1]]
##  [1] "zoneAveTemp"     "HVACmode"        "fan"             "outdoorTemp"
##  [5] "outdoorHumidity" "sky"             "wind"            "zoneClimate"
##  [9] "zoneCoolTemp"    "zoneHeatTemp"    "zoneHVACmode"    "zoneOccupancy"

Perhaps a lesser known feature of readr::read_csv() is that rather than a file, it can read a character vector where each element has comma-separated values and return a tibble.

library(magrittr)
report$reportList[[1]]$rowList %>%
unlist() %>%
# add date and time to the specified column names
read_csv(col_names=c('date', 'time', strsplit(report$columns, split=',')[[1]])) date time zoneAveTemp HVACmode fan outdoorTemp outdoorHumidity sky wind zoneClimate zoneCoolTemp zoneHeatTemp zoneHVACmode zoneOccupancy 2021-01-25 00:00:00 70.5 heat 75 28.4 43 5 0 Sleep 71 71 heatOff 0 2021-01-25 00:05:00 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 2021-01-25 00:10:00 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 2021-01-25 00:15:00 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 2021-01-25 00:20:00 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 We repeat that for each element in report$reportList and we have a nice tibble with the data we need.

library(purrr)
##
## Attaching package: 'purrr'
## The following object is masked from 'package:magrittr':
##
##     set_names
# using the thermostat IDs as names let's them be identified in the tibble
names(report$reportList) <- purrr::map_chr(report$reportList, 'thermostatIdentifier')
central_thermostat_info <- report$reportList %>% map_df( ~ read_csv(unlist(.x$rowList), col_names=c('date', 'time', strsplit(report$columns, split=',')[[1]])), .id='Thermostat' ) central_thermostat_info Thermostat date time zoneAveTemp HVACmode fan outdoorTemp outdoorHumidity sky wind zoneClimate zoneCoolTemp zoneHeatTemp zoneHVACmode zoneOccupancy 28716 2021-01-25 00:00:00 70.5 heat 75 28.4 43 5 0 Sleep 71 71 heatOff 0 28716 2021-01-25 00:05:00 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 28716 2021-01-25 00:10:00 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 28716 2021-01-25 00:15:00 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 28716 2021-01-25 00:20:00 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 17611 2021-01-25 00:00:00 64.4 heat 135 28.4 43 5 0 Sleep 78 64 heatOff 0 17611 2021-01-25 00:05:00 64.2 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0 17611 2021-01-25 00:10:00 64.0 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0 17611 2021-01-25 00:15:00 63.9 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0 17611 2021-01-25 00:20:00 63.7 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0 Sensor Level Data Due to the way the sensor data is stored they are more difficult to extract than the thermostat data. First, we get the column names for the sensors. The problem is, this varies depending on how many sensors are associated with each thermostat and also the type of sensor. As of now there are three types: the thermostat itself, which measures occupancy, temperature and humidity and two kinds of remote sensors, both of which measure occupancy and temperature. To do this we write two functions, one relates a sensor ID to a sensor name and the other joins in the result of the first function to match the column names listed in the columns element of each element in the sensorList part of the report. relate_sensor_id_to_name <- function(sensorInfo) { purrr::map_df( sensorInfo, ~tibble::tibble(ID=.x$sensorId, name=glue::glue('{.x$sensorName}_{.x$sensorType}'))
)
}

# see how it works
relate_sensor_id_to_name(report$sensorList[[1]]$sensors)
ID name
rs:100:2 Bedroom 1_occupancy
rs:101:1 Master Bedroom_temperature
rs:101:2 Master Bedroom_occupancy
rs2:100:1 Office 1_temperature
rs2:100:2 Office 1_occupancy
rs2:101:1 Office 2_temperature
rs:100:1 Bedroom 1_temperature
rs2:101:2 Office 2_occupancy
ei:0:1 Thermostat Temperature_temperature
ei:0:2 Thermostat Humidity_humidity
ei:0:3 Thermostat Motion_occupancy
make_sensor_column_names <- function(sensorInfo)
{
sensorInfo$columns %>% unlist() %>% tibble::enframe(name='index', value='id') %>% dplyr::left_join(relate_sensor_id_to_name(sensorInfo$sensors), by=c('id'='ID')) %>%
dplyr::mutate(name=as.character(name)) %>%
dplyr::mutate(name=dplyr::if_else(is.na(name), id, name))
}

# see how it works
make_sensor_column_names(report$sensorList[[1]]) index id name 1 date date 2 time time 3 rs:100:2 Bedroom 1_occupancy 4 rs:101:1 Master Bedroom_temperature 5 rs:101:2 Master Bedroom_occupancy 6 rs2:100:1 Office 1_temperature 7 rs2:100:2 Office 1_occupancy 8 rs2:101:1 Office 2_temperature 9 rs:100:1 Bedroom 1_temperature 10 rs2:101:2 Office 2_occupancy 11 ei:0:1 Thermostat Temperature_temperature 12 ei:0:2 Thermostat Humidity_humidity 13 ei:0:3 Thermostat Motion_occupancy Then for a set of sensors we can read the data from the data element using the read_csv() trick we saw earlier. Some manipulation is needed to we pivot the data longer, keep certain rows, break apart a column using tidyr::separate() make some changes with dplyr::mutate() then pivot wider.This results in a tibble, where each row represents the occupancy and temperature reading for a particular sensor and a given five-minute increment. extract_one_sensor_info <- function(sensor) { sensor_col_names <- make_sensor_column_names(sensor)$name
sensor$data %>% unlist() %>% readr::read_csv(col_names=sensor_col_names) %>% # make it longer so we can easy remove rows based on a condition tidyr::pivot_longer(cols=c(-date, -time), names_to='Sensor', values_to='Reading') %>% # we use slice because grep() returns a vector of indices, not TRUE/FALSE dplyr::slice(grep(pattern='_(temperature)|(occupancy)$', x=Sensor, ignore.case=FALSE)) %>%
# split apart the sensor name from what it's measuring
tidyr::separate(col=Sensor, into=c('Sensor', 'Measure'), sep='_', remove=TRUE) %>%
# rename the actual thermostats to say thermostat
dplyr::mutate(Sensor=sub(pattern='Thermostat .+$', replacement='Thermostat', x=Sensor)) %>% # back into wide format so each sensor is its own column tidyr::pivot_wider(names_from=Measure, values_from=Reading) } # see how it works extract_one_sensor_info(report$sensorList[[1]])
date time Sensor occupancy temperature
2021-01-25 00:00:00 Bedroom 1 0 69.5
2021-01-25 00:00:00 Master Bedroom 1 71.6
2021-01-25 00:00:00 Office 1 1 72.7
2021-01-25 00:00:00 Office 2 0 74.9
2021-01-25 00:00:00 Thermostat 0 65.8
2021-01-25 00:05:00 Bedroom 1 0 69.3
2021-01-25 00:05:00 Master Bedroom 1 71.3
2021-01-25 00:05:00 Office 1 0 72.5
2021-01-25 00:05:00 Office 2 0 74.8
2021-01-25 00:05:00 Thermostat 0 65.7
2021-01-25 00:10:00 Bedroom 1 0 69.2
2021-01-25 00:10:00 Master Bedroom 0 71.2
2021-01-25 00:10:00 Office 1 0 72.5
2021-01-25 00:10:00 Office 2 0 74.8
2021-01-25 00:10:00 Thermostat 0 65.6
2021-01-25 00:15:00 Bedroom 1 0 69.2
2021-01-25 00:15:00 Master Bedroom 1 71.2
2021-01-25 00:15:00 Office 1 0 72.5
2021-01-25 00:15:00 Office 2 0 74.8
2021-01-25 00:15:00 Thermostat 0 65.5
2021-01-25 00:20:00 Bedroom 1 0 69.4
2021-01-25 00:20:00 Master Bedroom 1 71.2
2021-01-25 00:20:00 Office 1 0 72.7
2021-01-25 00:20:00 Office 2 0 74.9
2021-01-25 00:20:00 Thermostat 0 65.4

Then we put it all together and iterate over the sets of sensors attached to each thermostat.

extract_sensor_info <- function(report)
{
names(report$sensorList) <- purrr::map_chr(report$reportList, 'thermostatIdentifier')
, Name=map_chr(thermostat_info$thermostatList, 'name') ) all_info <- inner_join(x=central_thermostat_info, y=sensor_info, by=c('Thermostat', 'date', 'time')) %>% left_join(thermostat_names, by=c('Thermostat'='ID')) %>% mutate(Sensor=if_else(Sensor=='Thermostat', glue::glue('{Name} Thermostat'), Sensor)) %>% relocate(Name, Sensor, date, time, temperature, occupancy) # see it all_info Name Sensor date time temperature occupancy Thermostat zoneAveTemp HVACmode fan outdoorTemp outdoorHumidity sky wind zoneClimate zoneCoolTemp zoneHeatTemp zoneHVACmode zoneOccupancy Upstairs Bedroom 1 2021-01-25 00:00:00 69.5 0 28716 70.5 heat 75 28.4 43 5 0 Sleep 71 71 heatOff 0 Upstairs Master Bedroom 2021-01-25 00:00:00 71.6 1 28716 70.5 heat 75 28.4 43 5 0 Sleep 71 71 heatOff 0 Upstairs Office 1 2021-01-25 00:00:00 72.7 1 28716 70.5 heat 75 28.4 43 5 0 Sleep 71 71 heatOff 0 Upstairs Office 2 2021-01-25 00:00:00 74.9 0 28716 70.5 heat 75 28.4 43 5 0 Sleep 71 71 heatOff 0 Upstairs Upstairs Thermostat 2021-01-25 00:00:00 65.8 0 28716 70.5 heat 75 28.4 43 5 0 Sleep 71 71 heatOff 0 Upstairs Bedroom 1 2021-01-25 00:05:00 69.3 0 28716 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 Upstairs Master Bedroom 2021-01-25 00:05:00 71.3 1 28716 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 Upstairs Office 1 2021-01-25 00:05:00 72.5 0 28716 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 Upstairs Office 2 2021-01-25 00:05:00 74.8 0 28716 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 Upstairs Upstairs Thermostat 2021-01-25 00:05:00 65.7 0 28716 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 Upstairs Bedroom 1 2021-01-25 00:10:00 69.2 0 28716 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 Upstairs Master Bedroom 2021-01-25 00:10:00 71.2 0 28716 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 Upstairs Office 1 2021-01-25 00:10:00 72.5 0 28716 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 Upstairs Office 2 2021-01-25 00:10:00 74.8 0 28716 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 Upstairs Upstairs Thermostat 2021-01-25 00:10:00 65.6 0 28716 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 Upstairs Bedroom 1 2021-01-25 00:15:00 69.2 0 28716 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 Upstairs Master Bedroom 2021-01-25 00:15:00 71.2 1 28716 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 Upstairs Office 1 2021-01-25 00:15:00 72.5 0 28716 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 Upstairs Office 2 2021-01-25 00:15:00 74.8 0 28716 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 Upstairs Upstairs Thermostat 2021-01-25 00:15:00 65.5 0 28716 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 Upstairs Bedroom 1 2021-01-25 00:20:00 69.4 0 28716 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 Upstairs Master Bedroom 2021-01-25 00:20:00 71.2 1 28716 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 Upstairs Office 1 2021-01-25 00:20:00 72.7 0 28716 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 Upstairs Office 2 2021-01-25 00:20:00 74.9 0 28716 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 Upstairs Upstairs Thermostat 2021-01-25 00:20:00 65.4 0 28716 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0 Downstairs Living Room 2021-01-25 00:00:00 64.4 0 17611 64.4 heat 135 28.4 43 5 0 Sleep 78 64 heatOff 0 Downstairs Playroom 2021-01-25 00:00:00 63.7 0 17611 64.4 heat 135 28.4 43 5 0 Sleep 78 64 heatOff 0 Downstairs Kitchen 2021-01-25 00:00:00 66.3 0 17611 64.4 heat 135 28.4 43 5 0 Sleep 78 64 heatOff 0 Downstairs Downstairs Thermostat 2021-01-25 00:00:00 65.1 0 17611 64.4 heat 135 28.4 43 5 0 Sleep 78 64 heatOff 0 Downstairs Living Room 2021-01-25 00:05:00 64.2 0 17611 64.2 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0 Downstairs Playroom 2021-01-25 00:05:00 63.4 0 17611 64.2 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0 Downstairs Kitchen 2021-01-25 00:05:00 65.8 0 17611 64.2 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0 Downstairs Downstairs Thermostat 2021-01-25 00:05:00 64.8 0 17611 64.2 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0 Downstairs Living Room 2021-01-25 00:10:00 63.9 0 17611 64.0 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0 Downstairs Playroom 2021-01-25 00:10:00 63.2 0 17611 64.0 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0 Downstairs Kitchen 2021-01-25 00:10:00 65.7 0 17611 64.0 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0 Downstairs Downstairs Thermostat 2021-01-25 00:10:00 64.4 0 17611 64.0 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0 Downstairs Living Room 2021-01-25 00:15:00 63.7 0 17611 63.9 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0 Downstairs Playroom 2021-01-25 00:15:00 62.9 0 17611 63.9 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0 Downstairs Kitchen 2021-01-25 00:15:00 65.6 0 17611 63.9 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0 Downstairs Downstairs Thermostat 2021-01-25 00:15:00 64.1 0 17611 63.9 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0 Downstairs Living Room 2021-01-25 00:20:00 63.4 0 17611 63.7 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0 Downstairs Playroom 2021-01-25 00:20:00 62.6 0 17611 63.7 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0 Downstairs Kitchen 2021-01-25 00:20:00 65.4 0 17611 63.7 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0 Downstairs Downstairs Thermostat 2021-01-25 00:20:00 63.9 0 17611 63.7 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0 We now have a nice tibble in long format where each row shows a number of measurements and setting for a sensor for a given period of time. Saving Data in the Cloud Now that we have the data we need to save it somewhere. Since these data should be more resilient, cloud storage seems like the best option. There are many services including Azure, AWS, BackBlaze and GCP. But I prefer to use DigitalOcean Spaces since it has an easy interface and can take advantage of standard S3 APIs. DigitalOcean uses slightly different terminology than AWS, in that buckets are called spaces. Otherwise we still need to deal with Access Key IDs, Secret Access Keys, Regions and Endpoints. A good tutorial on creating DigitalOcean spaces comes directly from them. In fact, they have so many great tutorials about Linux in general, I highly recommend using them for learning so much about computing. While there is a DigitalOcean package, aptly named {analogsea}, I did not have luck getting it to work, so instead I used {aws.s3}. After writing the all_info tibble to a CSV, we put it to the DigitalOcean bucket using aws.s3::put_object(). This requires three arguments: • file: The name of the file on our computer. • object: The path including filename inside the S3 bucket where the file will be saved. • bucket: The name of the S3 bucket, or in our case the space (DigitalOcean calls buckets spaces). Implicitly, put_object() and most of the functions in {aws.s3} depend on certain environment variables being set. The first two are AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY which correspond to the DigitalOcean Access Key and Secret Key, respectively. These can be generated at https://cloud.digitalocean.com/account/api/tokens. The Access Key can be retrieved at any time, but the Secret Key is only displayed one time, so we must save it somewhere. Just like AWS, DigitalOcean has regions and this value is saved in the AWS_DEFAULT_REGION environment variable. In this case the region is nyc3. Then we need the AWS_S3_ENDPOINT environment variable. The DigitalOcean documentation makes it seem like this should be nyc3.digitaloceanspaces.com (or the appropriate region) but for our purposes it should just be digitaloceanspaces.com. Perhaps the most common way to set environment variables in R is to save them in the VARIABLE_NAME=value format in the .Renviron file either in our home directory or project directory. But we must be sure NOT to track this file with git lest we publicly expose our secret information. Then we can call the put_object() function. # we make a file each day so it is important to include the date in the name. filename <- 'all_info_2021-02-15.csv' aws.s3::put_object( file=filename, object=sprintf('%s/%s', 'do_folder_name', filename), bucket='do_bucket_name' ) After this, the file lives in a DigitalOcean space to be accessed later. Putting it All Together with {targets} There are a lot of moving parts to this whole process, even more so than displayed here. They could have all been put into a script but that tends to be fragile, so instead we use {targets}. This package, which is the eventual replacement for {drake}, builds a robust pipeline that schedules jobs to be run based on which jobs depend on others. It is intelligent in that it only runs jobs that are out of date (usually meaning neither the code nor data have been modified) and can run jobs in parallel. The author, Will Landau, gave an excellent talk about this at the New York Open Statistical Programming Meetup. In order to use {targets}, we need to have a file named _targets.R in the root of our project. In there we have a list of targets, or jobs, each defined in a tar_target() function (or tar_force(), tar_change() or similar function). The first argument to tar_target() is the name of the target and the next argument is an expression, usually a function call, whose value is saved as the name. A small example would be a target to get the access token, a target for getting thermostat information (which requires the access token), another to extract the thermostat IDs from that information and a last target to get the report based on the IDs and token. library(targets) list( tar_target( access_token, httr::POST( url=glue::glue("https://api.ecobee.com/token?grant_type=refresh_token&code={Sys.getenv('ECOBEE_REFRESH_TOKEN')}&client_id={Sys.getenv('ECOBEE_API_KEY')}"), encode='json' ) ) , tar_target( thermostat_info, httr::GET( 'https://api.ecobee.com/1/thermostat?format=json&body={"selection":{"selectionType":"registered","selectionMatch":"","includeLocation":true}}' , httr::add_headers(Authorization=sprintf('Bearer %s', access_token)) , encode='json' ) %>% httr::content() ) , tar_target( thermostat_ids, thermostat_info$thermostatList %>%
purrr::map_chr('identifier') %>%
paste(collapse=',')
)
, tar_target(
report,
httr::GET(
sprintf(
'https://api.ecobee.com/1/runtimeReport?format=json&body={"startDate":"2021-01-26","endDate":"2021-01-26","startInterval":60,"endInterval":59,"columns":"zoneAveTemp,hvacMode,fan,outdoorTemp,outdoorHumidity,sky,wind,zoneClimate,zoneCoolTemp,zoneHeatTemp,zoneHvacMode,zoneOccupancy","selection":{"selectionType":"thermostats","selectionMatch":"%s"},"includeSensors":true}',
thermostat_ids
)
, encode='json'
) %>%
httr::content()
)
)

With the _targets.R file in the root of the directory, we can visualize the steps with tar_visnetwork().

tar_visnetwork()

To execute the jobs we use tar_make().

tar_make()

The results of individual jobs, known as targets, can be loaded into the working session with tar_load(target_name) or assigned to a variable with variable_name <- tar_read(target_name).

Each time this is run the appropriate file in the DigitalOcean space either gets written for the first time, or overwritten.

The actual set of targets for this project were much more involved and can be found on GitHub. Each target is the result of a custom function, which allows for more thorough documentation and testing. The version on GitHub even allows us to run the workflow for different dates so we can backfill data if needed.

Please note that anywhere there is potentially sensitive information such as the refresh token or API key, those were saved in environment variables, then accessed via Sys.getenv(). It is very important not to check the .Renviron files, where environment variables are stored, into git. They should be treated like passwords.

What’s Next?

Now that we have a functioning workflow that we can call anytime, the question becomes how do we run this on a regular basis? We could set up a cron job on a server that requires the server to always be up and we need to maintain this process. We could use scheduled lambda functions, but that’s also a lot of work. Instead, we’ll use GitHub Actions to run this workflow on a scheduled basis. We’ll go over that in the next blog post.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Posted in R.

The inaugural Government & Public Sector R Conference took place virtually from December 2nd to December 4th. With over 240 attendees, 26 speakers, three panelists and a rum masterclass class leader, the R|Gov conference was a place where data scientists could gather remotely to explore, share, and inspire ideas.

We had so many amazing speakers, whom we would like to thank: Lucy D’Agostino McGowan (Wake Forest University), Dr. Andrew Gelman (Columbia University), Dr. Graciela Chichilnisky (Global Thermostat), Dr. David Meza (NASA), Maj. Maxine Drake (US Army), Alex Gold (RStudio), Kimberly F. Sellers (Georgetown University; The U. S. Census Bureau), Dr. Tyler Morgan-Wall (Institute for Defense Analyses (IDA)), Imane El Idrissi & Dr. Anna Mantsoki (Foundation for Innovative New Diagnostics), Dr. Wendy Martinez (Bureau of Labor Statistics (BLS)), Col. Alfredo Corbett (US Air Force), Rose Martinez & Brooke Frye (New York City Council Data Team), Yvan Gauthier (Department of National Defence), Michael Jadoo (BLS), Tommy Jones (In-Q-Tel), Selina Carter (IDB), Refael Lav (Deloitte’s Federal Government Services teams), Dr. Abhijit Dasgupta (Zansors), Dr. Simina Boca (Georgetown University Medical Center), Dr. Wil Doane (IDA), Mo Johnson-León (Insight Lane), Dan Chen (Virginia Tech), Dr. Gwynn Sturdevant (HBS & R-Ladies DC), Marck Vaisman (Microsoft), Jonathan Hersh (Argyros School of Business), Kaz Sakamoto (Lander Analytics & Columbia University), Emily Martinez (NYC Department of Health and Mental Hygiene), Dan Whitenack (SIL International & Practical AI Podcast), Danya Murali (Arcadia Power), Malcolm Barrett (Teladoc Health) and myself.

All the talks will be shared on rstats.ai and the Lander Analytics YouTube channel in the very near future. Stay tuned!

Check out some of the highlights from the conference:

Graciela Chichilnisky explains how financial instruments can resolve climate change

One of my former professors at Columbia University, Dr. Graciela Chichilnisky, gave a presentation on how financial instruments can resolve climate change quickly and effectively by using existing capital markets to benefit high—and, especially, low—income groups. The process Dr. Chichilnisky proposes is simple and can lead to a transformation of our capitalistic economy in the direction of human survival. Furthermore, it is realistic and is profitable. Dr. Chichilnisky acted as the lead U.S. author on the Intergovernmental panel on Climate Change, which received the 2007 Nobel Prize for its work in deciding world policy with respect to climate change, and she worked extensively on the Kyoto Protocol, creating and designing the carbon market that became international law in 2005.

Another classic no-slides talk from Andrew Gelman on how his team and The Economist Magazine built a presidential election forecasting model

Another professor of mine, Andrew Gelman told us he wanted to give a talk on how his team’s election forecasting succeeded brilliantly, failed miserably, or landed somewhere in between. To build the model, they combined national polls, state polls, and political and economic fundamentals. Because we didn’t know the results of the election at the time, he didn’t know which of the three he’d be talking about… So how did his election forecast perform? The model predicted 49 out of 50 states correctly… But that doesn’t mean the forecast was perfect… For some background, see this article.

Wendy Martinez inspires and shares lessons about the rocky road she traveled to using R at a U.S. Government agency

Wendy Martinez described some of her experiences — both successes and failures — using R at several U.S. government agencies. In addition to serving as the Director of the Mathematical Statistics Research Center at the Bureau of Labor Statistics (BLS) for the last eight years, she is currently the President of the American Statistical Association (ASA), and she also served in several research positions throughout the Department of Defense. She has also written two books on MATLAB! It’s nice to see that she switched to open source.

Colonel Alfredo Corbett Spoke On Air Combat Command Enterprise Data Improvements

Deputy Director of Communications of the United States Air Force Colonel Alfedo Corbett showed us why, in his work, data can be a warfighting asset, fundamental to how Air Combat Command (ACC) operates in—and supports—all domains of warfare. In coordination with the Department of Defense and the Department of the Air Force, ACC is working to improve its data governance, data architecture, data standards, and data talent & culture, implementing major improvements to the way it manages, acquires, ingests, stores, processes, exploits, analyzes, and delivers data to its almost 100,000 operators.

We Participated in Two Virtual Happy Hours!

At lunch on the first day of the conference, we took a dive into the history and distillation process of a legendary rum made at the longest continuously running distillery in the world, Mount Gay Brand Ambassador Darrio Prescod shared his knowledge and transported us to Barbados (where he tuned in from virtually). Following the second day of the conference, members of the Mount Gay brand development team took us through a rum tasting and shook up a couple of cocktails. Attendees and speakers listened and hung out, drinking rum, matcha, soda or water during our virtual happy hour.

All proceeds from the A(R)T Auction went to the R Foundation

The A(R)T Auction was held in support of the  R Foundation, featuring pieces by artists in the R Community. Artists included Nadieh Bremer (left), Selina Carter, Thomas Lin Pedersen (right), Will Chase, DiKayo Data.

We took an R-Ladies group [virtual] selfie. We would like to note that more R-Ladies participated, but chose not to share video.:

Jon Harmon, Selina Carter, Mayarí Montes de Oca & DiKayo Data win Raspberry Pis, Noise Cancelling Headphones, and Gaming Mechanical Keyboards for Most Active Tweeting
You can see the R|Gov 2020 R Shiny Scoreboard here! A custom started at DCR 2018 by our Twitter scorekeeper Malorie Hughes (@data_all_day), has returned every year by popular demand. Congratulations to our winners!

52 Conference Attendees Participated in Pre-Conference Workshops

We ran the following workshops prior to the conference:

Moving from DCR to R|Gov
With the shift to remote, we realized we could welcome a global audience to our annual conference, as we did for the virtual New York R Conference in August. And that gave birth to R|Gov, the Government and Public Sector R Conference. This new industry-focused conference focused on work in government, defense, NGOs and the public sector, and we have speakers from not only the DC-area, but also from Geneva, Switzerland, Nashville, Tennessee, Quebec, Canada and Los Angeles, California. For next year, we are working to invite speakers from more levels of government–local, state and federal. You can read more about this choice here.

Like NYR, R|Gov featured many in-person components of the gathering, like networking sessions, speaker walk-on songs and fun facts, happy hours, lots of giveaways, the Twitter contest, and the auction.

Thank you, Lander Analytics Team!

Even though it was virtual, there was a lot of work that went into the conference, and I want to thank my amazing team at Lander Analytics along with our producer, Bill Prickett, for making it all come together.

Looking Forward to New York, R|Gov, and Dublin!

If you attended, we hope you had an incredible experience. If you did not attend this year’s conference, we hope to see you at the at the New York R Conference and R|Gov in 2021, and, soon, the first Dublin R Conference.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Posted in R.

My team at Lander Analytics has been putting together conferences for six years, and they’ve always had the same fun format, which the community has really enjoyed. There’s the NYR conference for New Yorkers and those who want to fly, drive or train to join the New York community, and there’s DCR, which gathers the DC-area community. The last DCR Conference at Georgetown University went really well, as you can see in this recap. With the shift to virtual gatherings brought on by the pandemic, our community has gone fully remote, including the monthly Open Statistical Programming Meetup. With that, we realized the DCR Conference didn’t just need to be for folks from the DC-area anymore, instead, we could welcome a global audience like we did with this year’s NYR. And that gave birth to R|Gov, the Government and Public Sector R Conference.

R|Gov is really a new industry-focused conference. Instead of drawing on speakers from a particular city or area, the talks will focus on work done in specific fields. In this case, in government, defense, NGOs and the public sector, and we have speakers from not only the DC-area, but also from Geneva, Switzerland, Nashville, Tennessee, Quebec, Canada and Los Angeles, California. For the last three years, we have been working with Data Community DC, R-Ladies DC, and the Statistical Programming DC Meetup, to put on DCR, and continue to do so for R|Gov as we find great speakers and organizations who want to collaborate in driving attendance and building the community.

Like NYR and DCR, the topics at R|Gov  range from practical how-tos, to theoretical findings, to processes, to tooling and the speakers this year come from the Center for Army Analysis, NASA, Columbia University, The U.S. Bureau of Labor Statistics, the Inter-American Development Bank, The United States Census Bureau, Harvard Business School, In-Q-Tel, Virginia Tech, Deloitte, NYC Department of Health and Mental Hygiene and Georgetown University, among others. We will also be hosting two rum and gin master classes, including one with Mount Gay, which comes from the oldest continuously running rum distillery in the world, and which George Washington served at his inauguration!

The R Conference series is quite a bit different from other industry and academic conferences.  The talks are twenty minutes long with no audience questions with the exception of special talks from the likes of Andrew Gelman or Hadley Wickham. Whether in person or virtual, we play music, have prize giveaways and involve food in the programming. When they were in person, we prided ourselves on avocado toast, pizza, ice cream and beer. For prizewinners, we autographed books right on stage since the authors were either speakers or in the audience. With the virtual events we try to capture as much of that spirit as possible, and the community really enjoyed the virtual R Conference | NY in August. A very lively event remotely and in the flesh, it is also one of the more informative conferences I have ever seen.

This virtual conference will include much of the in-person format, just recreated virtually. We will have 24 talks, a panel, workshops, community and networking breaks, happy hours, prizes and giveaways, a Twitter Contest, Meet the Speaker series, Job Board access, and participation in the Art Auction. We hope to see you there December 2-4, on a comfy couch near you.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Posted in R.

The sixth annual (and first virtual) “New York” R Conference took place August 5-6 & 12-15. Almost 300 attendees, and 30 speakers, plus a stand-up comedian and a whiskey masterclass leader, gathered remotely to explore, share, and inspire ideas.

We had many awesome speakers, many new and some, returning: Dr. Rob J Hyndman (Monash University), Dr. Adam Obeng (Facebook), Ludmila Janda (Amplify), Emily Robinson (Warby Parker), Daniel Chen (Virginia Tech, Lander Analytics), Dr. Jon Krohn (untapt), Dr. Andrew Gelman (Columbia University), David Smith (Microsoft), Laura Gabrysiak (Visa), Brooke Watson (ACLU), Dr. Sebastian Teran Hidalgo (Vroom), Catherine Zhou (Codecademy), Dr. Jacqueline Nolis (Brightloom), Sonia Ang (Microsoft), Emily Dodwell (AT&T Labs Research), Jonah Gabry (Columbia University, Stan Development Team), Wes MckKinney and Dr. Neal Richardson (Ursa Labs), Dr. Thomas Mock (RStudio), Dr. David Robinson, (Heap), Dr. Max Kuhn (RStudio), Dr. Erin LeDell (H2O.ai), Monica Thieu (Columbia University), Camelia Hssaine (Codecademy), and myself and, coming soon, a bonus talk by Heather Nolis (T-Mobile) which will be shared on YouTube as soon as our team is done editing them, along with all the other talks.

Let’s take a look at some of the highlights from the conference:

Andrew Gelman Gave Another 40-Minute Talk (no slides, as always)

Our favorite quotes from Andrew Gelman’s talk, Truly Open Science: From Design and Data Collection to Analysis and Decision Making, which had no slides, as usual:

“Everyone training in statistics becomes a teacher.”

“The most important thing you should take away — put multiple graphs on a page.”

“Honesty and transparency are not enough.”

Laura Gabrysiak Shows us We Are Driven By Experience, and not Brand Loyalty…Hope you Folks had a Good Experience!

Laura’s talk on re-Inventing customer engagement with machine learning went through several interesting use cases from her time at Visa. In addition to being a data scientist, she is an active community organizer and the co-founder of R-Ladies Miami.

One of my former students at Columbia University, Adam Obeng, gave a great presentation on his adaptive experimentation. We learned that adaptive experimentation is three things: The name of (1) a family of techniques, (2) Adam’s team at Facebook, and (3) an open source package produced by said team. He went through the applications which are hyper-parameter optimization for ML, experimentation with multiple continuous treatments, and physical experiments or manufacturing.

Dr. Jacqueline Nolis Invited Us to Crash Her Viral Website, Tweet Mashup

Jacqueline asked the crowd to crash her viral website,Tweet Mashup, and gave a great talk on her experience building it back in 2016. Her website that lets you combine the tweets of two different people. After spending a year making it in .NET, when she launched the site it became an immediate sensation. Years later, she was getting more and more frustrated maintaining the F# code and decided to see if I could recreate it in Shiny. Doing so would require having Shiny integrate with the Twitter API in ways that hadn’t been done by anyone before, and pushing the Twitter API beyond normal use cases.

Attendees Participated in Two Virtual Happy Hours Packed with Fun

At the Friday Happy Hour, we had a mathematical standup comedian for the first time in R Conference history. Comic and math major Rachel Lander (no relationship to me!) entertained us with awesome math and stats jokes.

Following the stand up, we had a Whiskey Master Class with our Vibe Sponsor Westland Distillery, and another one on Saturday with Bruichladdich Distillery (hard to pronounce and easy to drink). Attendees and speakers learned and drank together, whether it be their whiskey, matchas, soda or water.

All Proceeds from the A(R)T Auction went to the R Foundation Again

A newer tradition, the A(R)T Auction, took place again! We featured pieces by artists in the R Community, and all proceeds were donated to the R Foundation. The highest-selling piece at auction was Street Cred (2020) by Vivian Peng (Lander Analytics and Los Angeles Mayor’s Office, Innovation Team). The second highest was a piece by Jacqueline Nolis (Brightloom, and Build a Career in Data Science co-author), R Conference speaker, Designed by Allison Horst, artist in residence at RStudio.

The R-Ladies Group Photo Happened, Even Remotely!

As per tradition, we took an R-Ladies group photo, but, for the first time, remotely– as a screenshot! We would like to note that many more R-Ladies were present in the chat, but just chose not to share video.

Jon Harmon, Edna Mwenda, and Jessica Streeter win Raspberri Pis, Bluetooth Headphones, and Tenkeyless Keyboards for Most Active Tweeting During the Conference

This year’s Twitter Contest, in Malorie’s words, was a “ruthless but noble war.” You can see the NYR 2020 Dashboard here. A custom started that DCR 2018 by our Twitter scorekeeper Malorie Hughes (@data_all_day) has returned every year by popular demand, and now she’s stuck with it forever! Congratulations to our winners!

50+ Conference Attendees Participated in Pre-Conference Workshops Before

For the first time ever, workshops took place over the course of several days to promote work-life balance, and to give attendees the chance to take more than one course. We ran the following seven workshops:

Recreating the In-Person Experience

We recreated as much of the in-person experience as possible with attendee networking sessions, the speaker walk-on songs and fun facts, abundant prizes and giveaways, the Twitter contest, an art auction, and happy hours. In addition to all of this, we mailed conference programs, hex stickers, and other swag to each attendee (in the U.S.), along with discount codes from our Vibe Sponsors, MatchaBar, Westland Distillery and Bruichladdich Distillery.

Thank you, Lander Analytics Team!

Even though it was virtual, there was a lot of work that went into the conference, and I want to thank my amazing team at Lander Analytics along with our producer, Bill Prickett, for making it all come together.

Looking Forward to D.C. and Dublin
If you attended, we hope you had an incredible experience. If you did not, we hope to see you at the virtual DC R Conference in the fall, and at the first Dublin R Conference and the NYR next year!

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Posted in R.

The Second Annual DCR Conference made its way to the ICC Auditorium at Georgetown University last week on November 8th and 9th. A sold-out crowd of R enthusiasts and data scientists gathered to explore, share and inspire ideas.

We had so many great speakers join us this year: David Robinson, Malorie Hughes, Stephanie Kirmer, Daniel Chen, Emily Robinson, Kelly O’Briant, Marck Vaisman, Elizabeth Sweeney, Brian Wright, Ami Gates, Selina Carter, Refael Lav, Thomas Jones, Abhijit Dasgupta, Angela Li, Alex Engler, BJ Bloom, Samantha Tyner, Tatyana Tsvetovat, Danya Murali, Ronald Cappellini, Jon Harmon, Kaelen Medeiros, Kimberly Kreiss and myself.

As always, the food was delicious! Our caterer even surprised us with Lander cookies.

David Robinson shared his Ten Tremendous Tricks in the Tidyverse. Always enthusiastic, DRob did a great job showing both well known and obscure functions for an easier data workflow.

Elizabeth Sweeney gave an awesome talk on Visualizing the Environmental Impact of Beef Consumption using Plotly and Shiny. We explored the impact of eating different cuts of beef in terms of the number of animal lives, Co2 emissions, water usage, and land usage. Did you know that there is a big difference in the environmental impact of consuming 100 pounds of hanger steak versus the same weight in ground beef? She used plotly to make interactive graphics and R Shiny to make an interactive webpage to explore the data.

The integrated development environment, RStudio, fully integrated themselves into the environment.

As a father, I’ve earned the right to make dad jokes (see above). You can see the slides for my talk, Raising Baby with R. While babies are commonly called bundles of joy, they are also bundles of data. Being the child of a data scientist and neuroscientist my son was certain to be analyzed myriad ways. I discussed how we used data to narrow down possible names then looked at using time series methods to analyze his sleeping and eating patterns. All in the name of science.

Malorie Hughes Analyzing Tweets Again

We also organized a Tweeting competition with the help of Malorie Hughes, our Twitter scorekeeper. Check out the DCR 2019 Twitter Dashboard with the Mash-Up Metric Details she created.

Our winner for Contribution & Engagement was Emily Robinson. Other notable winners included Kimberly Kreiss (see slides), Will Angel and Jon Harmon.

There was a glitch in the system and one of our own organizers and former Python user won a prize. We let her keep it, and now she has no excuse not to learn R.

Not to mention we had some great workshops on November 7th, preceding the conference:

Thanks to R-Ladies and Data Community D.C. for helping us spread the word.

Videos

The videos for the conference will be posted in the coming weeks to YouTube.com.

See You Next Year

Looking forward to more great conferences at next year’s NYR in the spring, Dublin R in the summer, and DCR again in the fall!

Hex Stickers

We went all out an ordered a few thousand hex stickers.

Speaker Slides

We don’t have them all yet but here are some to get started:

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Posted in R.

The costs involved with a health insurance plan can be confusing so I perform an analysis of different options to find which plan is most cost effective

My wife and I recently brought a new R programmer into our family so we had to update our health insurance. Becky is a researcher in neuroscience and psychology at NYU so we decided to choose an NYU insurance plan.

For families there are two main plans: Value and Advantage. The primary differences between the plans are the following:

Item Explanation Value Plan Amount Advantage Plan Amount
Bi-Weekly Premiums The amount we pay every other week in order to have insurance $160 ($4,160 annually) $240 ($6,240 annually)
Deductible Amount we pay directly to health providers before the insurance starts covering costs $1,000$800
Coinsurance After the deductible is met, we pay this percentage of medical bills 20% 10%
Out-of-Pocket Maximum This is the most we will have to pay to health providers in a year (premiums do not count toward this max) $6,000$5,000

We put them into a tibble for use later.

# use tribble() to make a quick and dirty tibble
parameters <- tibble::tribble(
'Value', 160*26, 1000, 0.2, 6000,
)

Other than these cost differences, there is not any particular benefit of either plan over the other. That means whichever plan is cheaper is the best to choose.

This blog post walks through the steps of evaluating the plans to figure out which to select. Code is included so anyone can repeat, and improve on, the analysis for their given situation.

Cost

In order to figure out which plan to select we need to figure out the all-in cost, which is a function of how much we spend on healthcare in a year (we have to estimate our annual spending) and the aforementioned premiums, deductible, coinsurance and out-of-pocket maximum.

$\text{cost} = f(\text{spend}; \text{premiums}, \text{deductible}, \text{coinsurance}, \text{oop_maximum}) = \\ \text{min}(\text{oop_maximum}, \text{deductible} + \text{coinsurance}*(\text{spend}-\text{deductible}))+\text{premiums}$

This can be written as an R function like this.

#' @title cost
#' @description Given healthcare spend and other parameters, calculate the actual cost to the user
#' @details Uses the formula above to caluclate total costs given a certain level of spending. This is the premiums plus either the out-of-pocket maximum, the actual spend level if the deductible has not been met, or the amount of the deductible plus the coinsurance for spend above the deductible but below the out-of-pocket maximum.
#' @author Jared P. Lander
#' @param spend A given amount of healthcare spending as a vector for multiple amounts
#' @param deductible The deductible for a given plan
#' @param coinsurance The coinsurance percentage for spend beyond the deductible but below the out-of-pocket maximum
#' @param oop_maximum The maximum amount of money (not including premiums) that the insured will pay under a given plan
#' @return The total cost to the insured
#' @examples
#' cost(3000, 4160, 1000, .20, 6000)
#' cost(3000, 6240, 800, .10, 5000)
#'
cost <- function(spend, premiums, deductible, coinsurance, oop_maximum)
{
# spend is vectorized so we use pmin to get the min between oop_maximum and (deductible + coinsurance*(spend - deductible)) for each value of spend provided
pmin(
# we can never pay more than oop_maximum so that is one side
oop_maximum,
# if we are under oop_maximum for a given amount of spend,
# this is the cost
pmin(spend, deductible) + coinsurance*pmax(spend - deductible, 0)
) +
}

With this function we can see if one plan is always, or mostly, cheaper than the other plan and that’s the one we would choose.

R Packages

For the rest of the code we need these R packages.

library(dplyr)
library(ggplot2)
library(tidyr)
library(formattable)
library(readxl)

Spending

To see our out-of-pocket cost at varying levels of healthcare spend we build a grid in $1,000 increments from$1,000 to $70,000. spending <- tibble::tibble(Spend=seq(1000, 70000, by=1000)) We call our cost function on each amount of spend for the Value and Advantage plans. spending <- spending %>% # use our function to calcuate the cost for the value plan mutate(Value=cost( spend=Spend, premiums=parameters$Premiums[1],
deductible=parameters$Deductible[1], coinsurance=parameters$Coinsurance[1],
oop_maximum=parameters$OOP_Maximum[1] ) ) %>% # use our function to calcuate the cost for the Advantage plan mutate(Advantage=cost( spend=Spend, premiums=parameters$Premiums[2],
deductible=parameters$Deductible[2], coinsurance=parameters$Coinsurance[2],
oop_maximum=parameters$OOP_Maximum[2] ) ) %>% # compute the difference in costs for each plan mutate(Difference=Advantage-Value) %>% # the winner for a given amount of spend is the cheaper plan mutate(Winner=if_else(Advantage < Value, 'Advantage', 'Value')) The results are in the following table, showing every other row to save space. The Spend column is a theoretical amount of spending with a red bar giving a visual sense for the increasing amounts. The Value and Advantage columns are the corresponding overall costs of the plans for the given amount of Spend. The Difference column is the result of AdvantageValue where positive numbers in blue mean that the Value plan is cheaper while negative numbers in red mean that the Advantage plan is cheaper. This is further indicated in the Winner column which has the corresponding colors. Spend Value Advantage Difference Winner$2,000 $5,360$7,160 1800 Value
$4,000$5,760 $7,360 1600 Value$6,000 $6,160$7,560 1400 Value
$8,000$6,560 $7,760 1200 Value$10,000 $6,960$7,960 1000 Value
$12,000$7,360 $8,160 800 Value$14,000 $7,760$8,360 600 Value
$16,000$8,160 $8,560 400 Value$18,000 $8,560$8,760 200 Value
$20,000$8,960 $8,960 0 Value$22,000 $9,360$9,160 -200 Advantage
$24,000$9,760 $9,360 -400 Advantage$26,000 $10,160$9,560 -600 Advantage
$28,000$10,160 $9,760 -400 Advantage$30,000 $10,160$9,960 -200 Advantage
$32,000$10,160 $10,160 0 Value$34,000 $10,160$10,360 200 Value
$36,000$10,160 $10,560 400 Value$38,000 $10,160$10,760 600 Value
$40,000$10,160 $10,960 800 Value$42,000 $10,160$11,160 1000 Value
$44,000$10,160 $11,240 1080 Value$46,000 $10,160$11,240 1080 Value
$48,000$10,160 $11,240 1080 Value$50,000 $10,160$11,240 1080 Value
$52,000$10,160 $11,240 1080 Value$54,000 $10,160$11,240 1080 Value
$56,000$10,160 $11,240 1080 Value$58,000 $10,160$11,240 1080 Value
$60,000$10,160 $11,240 1080 Value$62,000 $10,160$11,240 1080 Value
$64,000$10,160 $11,240 1080 Value$66,000 $10,160$11,240 1080 Value
$68,000$10,160 $11,240 1080 Value$70,000 $10,160$11,240 1080 Value

Of course, plotting often makes it easier to see what is happening.

spending %>%
# put the plot in longer format so ggplot can set the colors
gather(key=Plan, value=Cost, -Spend) %>%
ggplot(aes(x=Spend, y=Cost, color=Plan)) +
geom_line(size=1) +
scale_x_continuous(labels=scales::dollar) +
scale_y_continuous(labels=scales::dollar) +
scale_color_brewer(type='qual', palette='Set1') +
labs(x='Healthcare Spending', y='Out-of-Pocket Costs') +
theme(
legend.position='top',
axis.title=element_text(face='bold')
)

It looks like there is only a small window where the Advantage plan is cheaper than the Value plan. This will be more obvious if we draw a plot of the difference in cost.

spending %>%
ggplot(aes(x=Spend, y=Difference, color=Winner, group=1)) +
geom_hline(yintercept=0, linetype=2, color='grey50') +
geom_line(size=1) +
scale_x_continuous(labels=scales::dollar) +
scale_y_continuous(labels=scales::dollar) +
labs(
x='Healthcare Spending',
y='Difference in Out-of-Pocket Costs Between the Two Plans'
) +
scale_color_brewer(type='qual', palette='Set1') +
theme(
legend.position='top',
axis.title=element_text(face='bold')
)

To calculate the exact cutoff points where one plan becomes cheaper than the other plan we have to solve for where the two curves intersect. Due to the out-of-pocket maximums the curves are non-linear so we need to consider four cases.

1. The spending exceeds the point of maximum out-of-pocket spend for both plans
2. The spending does not exceed the point of maximum out-of-pocket spend for either plan
3. The spending exceeds the point of maximum out-of-pocket spend for the Value plan but not the Advantage plan
4. The spending exceeds the point of maximum out-of-pocket spend for the Advantage plan but not the Value plan

When the spending exceeds the point of maximum out-of-pocket spend for both plans the curves are parallel so there will be no cross over point.

When the spending does not exceed the point of maximum out-of-pocket spend for either plan we set the cost calculations (not including the out-of-pocket maximum) for each plan equal to each other and solve for the amount of spend that creates the equality.

To keep the equations smaller we use variables such as $$d_v$$ for the Value plan deductible, $$c_a$$ for the Advantage plan coinsurance and $$oop_v$$ for the out-of-pocket maximum for the Value plan.

$d_v + c_v(S – d_v) + p_v = d_a + c_a(S – d_a) + p_a \\ c_v(S – D_v) – c_a(S-d_a) = d_a – d_v + p_a – p_v \\ c_vS – c_vd_v – c_aS + c_ad_a = d_a – d_v + p_a – p_v \\ S(c_v – c_a) = d_a – c_ad_a – d_v + c_vd_v + p_a – p_v \\ S(c_v – c_a) = d_a(1 – c_a) – d_v(1 – c_v) + p_a – p_v \\ S = \frac{d_a(1 – c_a) – d_v(1 – c_v) + p_a – p_v}{(c_v – c_a)}$

When the spending exceeds the point of maximum out-of-pocket spend for the Value plan but not the Advantage plan, we set the out-of-pocket maximum plus premiums for the Value plan equal to the cost calculation of the Advantage plan.

$oop_v + p_v = d_a + c_a(S – d_a) + p_a \\ d_a + c_a(S – d_a) + p_a = oop_v + p_v \\ c_aS – c_ad_a = oop_v + p_v – p_a – d_a \\ c_aS = oop_v + p_v – p_a + c_ad_a – d_a \\ S = \frac{oop_v + p_v – p_a + c_ad_a – d_a}{c_a}$

When the spending exceeds the point of maximum out-of-pocket spend for the Advantage plan but not the Value plan, the solution is just the opposite of the previous equation.

$oop_a + p_a = d_v + c_v(S – d_v) + p_v \\ d_v + c_v(S – d_v) + p_v = oop_a + p_a \\ c_vS – c_vd_v = oop_a + p_a – p_v – d_v \\ c_vS = oop_a + p_a – p_v + c_vd_v – d_v \\ S = \frac{oop_a + p_a – p_v + c_vd_v – d_v}{c_v}$

As an R function it looks like this.

#' @title calculate_crossover_points
#' @description Given healthcare parameters for two plans, calculate when one plan becomes more expensive than the other.
#' @details Calculates the potential crossover points for different scenarios and returns the ones that are true crossovers.
#' @author Jared P. Lander
#' @param deductible_1 The deductible plan 1
#' @param coinsurance_1 The coinsurance percentage for spend beyond the deductible for plan 1
#' @param oop_maximum_1 The maximum amount of money (not including premiums) that the insured will pay under plan 1
#' @param deductible_2 The deductible plan 2
#' @param coinsurance_2 The coinsurance percentage for spend beyond the deductible for plan 2
#' @param oop_maximum_2 The maximum amount of money (not including premiums) that the insured will pay under plan 2
#' @return The amount of spend at which point one plan becomes more expensive than the other
#' @examples
#' calculate_crossover_points(
#' 160, 1000, 0.2, 6000,
#' 240, 800, 0.1, 5000
#' )
#'
calculate_crossover_points <- function(
)
{
# calculate the crossover before either has maxed out
deductible_2*(1 - coinsurance_2) -
deductible_1*(1 - coinsurance_1)) /
(coinsurance_1 - coinsurance_2)

# calculate the crossover when one plan has maxed out but the other has not
one_maxed_out <- (oop_maximum_1 +
coinsurance_2*deductible_2 -
deductible_2) /
coinsurance_2

# calculate the crossover for the reverse
other_maxed_out <- (oop_maximum_2 +
coinsurance_1*deductible_1 -
deductible_1) /
coinsurance_1

# these are all possible points where the curves cross
all_roots <- c(neither_maxed_out, one_maxed_out, other_maxed_out)

# now calculate the difference between the two plans to ensure that these are true crossover points
all_differences <- cost(all_roots, premiums_1, deductible_1, coinsurance_1, oop_maximum_1) -

# only when the difference between plans is 0 are the curves truly crossing
all_roots[all_differences == 0]
}

We then call the function with the parameters for both plans we are considering.

crossovers <- calculate_crossover_points(
parameters$Premiums[1], parameters$Deductible[1], parameters$Coinsurance[1], parameters$OOP_Maximum[1],
parameters$Premiums[2], parameters$Deductible[2], parameters$Coinsurance[2], parameters$OOP_Maximum[2]
)

crossovers
## [1] 20000 32000

We see that the Advantage plan is only cheaper than the Value plan when spending between $20,000 and$32,000.

The next question is will our healthcare spending fall in that narrow band between $20,000 and$32,000 where the Advantage plan is the cheaper option?

Probability of Spending

This part gets tricky. I’d like to figure out the probability of spending between $20,000 and$32,000. Unfortunately, it is not easy to find healthcare spending data due to the opaque healthcare system. So I am going to make a number of assumptions. This will likely violate a few principles, but it is better than nothing.

Assumptions and calculations:

• Healthcare spending follows a log-normal distribution
• We will work with New York State data which is possibly different than New York City data
• We know the mean for New York spending in 2014
• We will use the accompanying annual growth rate to estimate mean spending in 2019
• We have the national standard deviation for spending in 2009
• In order to figure out the standard deviation for New York, we calculate how different the New York mean is from the national mean as a multiple, then multiply the national standard deviation by that number to approximate the New York standard deviation in 2009
• We use the growth rate from before to estimate the New York standard deviation in 2019

First, we calculate the mean. The Centers for Medicare & Medicaid Services has data on total and per capita medical expenditures by state from 1991 to 2014 and includes the average annual percentage growth. Since the data are bundled in a zip with other files, I posted them on my site for easy access.

spend_data_url <- 'https://jaredlander.com/data/healthcare_spending_per_capita_1991_2014.csv'
health_spend <- read_csv(spend_data_url)

We then take just New York spending for 2014 and multiply it by the corresponding growth rate.

ny_spend <- health_spend %>%
# get just New York
filter(State_Name == 'New York') %>%
# this row holds overall spending information
filter(Item == 'Personal Health Care ($)') %>% # we only need a few columns select(Y2014, Growth=Average_Annual_Percent_Growth) %>% # we have to calculate the spending for 2019 by accounting for growth # after converting it to a percentage mutate(Y2019=Y2014*(1 + (Growth/100))^5) ny_spend Y2014 Growth Y2019 9778 5 12479.48 The standard deviation is trickier. The best I can find was the standard deviation on the national level in 2009. In 2013 the Centers for Medicare & Medicaid Services wrote in Volume 3, Number 4 of Medicare & Medicaid Research Review an article titled Modeling Per Capita State Health Expenditure Variation: State-Level Characteristics Matter. Exhibit 2 shows that the standard deviation of healthcare spending was$1,241 for the entire country in 2009. We need to estimate the New York standard deviation from this and then account for growth into 2019.

Next, we figure out the difference between the New York State spending mean and the national mean as a multiple.

nation_spend <- health_spend %>%
filter(Item == 'Personal Health Care ($)') %>% filter(Region_Name == 'United States') %>% pull(Y2009) ny_multiple <- ny_spend$Y2014/nation_spend

ny_multiple
## [1] 1.418746

We see that the New York average is 1.4187464 times the national average. So we multiply the national standard deviation from 2009 by this amount to estimate the New York State standard deviation and assume the same annual growth rate as the mean. Recall, we can multiply the standard deviation by a constant.

\begin{align} \text{var}(x*c) &= c^2*\text{var}(x) \\ \text{sd}(x*c) &= c*\text{sd}(x) \end{align}

ny_spend <- ny_spend %>%
mutate(SD2019=1241*ny_multiple*(1 + (Growth/100))^10)

ny_spend
Y2014 Growth Y2019 SD2019
9778 5 12479.48 2867.937

My original assumption was that spending would follow a normal distribution, but New York’s resident agricultural economist, JD Long, suggested that the spending distribution would have a floor at zero (a person cannot spend a negative amount) and a long right tail (there will be many people with lower levels of spending and a few people with very high levels of spending), so a log-normal distribution seems more appropriate.

$\text{spending} \sim \text{lognormal}(\text{log}(12479), \text{log}(2868)^2)$

Visualized it looks like this.

draws <- tibble(
Value=rlnorm(
n=1200,
meanlog=log(ny_spend$Y2019), sdlog=log(ny_spend$SD2019)
)
)

ggplot(draws, aes(x=Value)) + geom_density() + xlim(0, 75000)

We can see that there is a very long right tail which means there are many low values and few high values.

Then the probability of spending between $20,000 and$32,000 can be calculated with plnorm().

plnorm(crossovers[2], meanlog=log(ny_spend$Y2019), sdlog=log(ny_spend$SD2019)) -
plnorm(crossovers[1], meanlog=log(ny_spend$Y2019), sdlog=log(ny_spend$SD2019))
## [1] 0.02345586

So we only have a 2.35% probability of our spending falling in that band where the Advantage plan is more cost effective. Meaning we have a 97.65% probability that the Value plan will cost less over the course of a year.

We can also calculate the expected cost under each plan. We do this by first calculating the probability of spending each (thousand) dollar amount (since the log-normal is a continuous distribution this is an estimated probability). We multiply each of those probabilities against their corresponding dollar amounts. Since the distribution is log-normal we need to exponentiate the resulting number. The data are on the thousands scale, so we multiply by 1000 to put it back on the dollar scale. Mathematically it looks like this.

$\mathbb{E}_{\text{Value}} \left[ \text{cost} \right] = 1000*\text{exp} \left\{ \sum p(\text{spend})*\text{cost}_{\text{Value}} \right\} \\ \mathbb{E}_{\text{Advantage}} \left[ \text{cost} \right] = 1000*\text{exp} \left\{ \sum p(\text{spend})*\text{cost}_{\text{Advantage}} \right\}$

The following code calculates the expected cost for each plan.

spending %>%
# calculate the point-wise estimated probabilities of the healthcare spending
# based on a log-normal distribution with the appropriate mean and standard deviation
mutate(
SpendProbability=dlnorm(
Spend,
meanlog=log(ny_spend$Y2019), sdlog=log(ny_spend$SD2019)
)
) %>%
# compute the expected cost for each plan
# and the difference between them
summarize(
ValueExpectedCost=sum(Value*SpendProbability),
ExpectedDifference=sum(Difference*SpendProbability)
) %>%
# exponentiate the numbers so they are on the original scale
mutate_each(funs=exp) %>%
# the spending data is in increments of 1000
# so multiply by 1000 to get them on the dollar scale
mutate_each(funs=~ .x * 1000)
5422.768 7179.485 1323.952

This shows that overall the Value plan is cheaper by about $1,324 dollars on average. Conclusion We see that there is a very small window of healthcare spending where the Advantage plan would be cheaper, and at most it would be about$600 cheaper than the Value plan. Further, the probability of falling in that small window of savings is just 2.35%.

So unless our spending will be between $20,000 and$32,000, which it likely will not be, it is a better idea to choose the Value plan.

Since the Value plan is so likely to be cheaper than the Advantage plan I wondered who would pick the Advantage plan. Economist Jon Hersh invokes behavioral economics to explain why people may select the Advantage plan. Some parts of the Advantage plan are lower than the Value plan, such as the deductible, coinsurance and out-of-pocket maximum. People see that under certain circumstances the Advantage plan would save them money and are enticed by that, not realizing how unlikely that would be. So they are hedging against a low probability situation. (A consideration I have not accounted for is family size. The number of members in a family can have a big impact on the overall spend and whether or not it falls into the narrow band where the Advantage plan is cheaper.)

In the end, the Value plan is very likely going to be cheaper than the Advantage plan.

Try it at Home

I created a Shiny app to allow users to plug in the numbers for their own plans. It is rudimentary, but it gives a sense for the relative costs of different plans.

Thanks

A big thanks to Jon Hersh, JD Long, Kaz Sakamoto, Rebecca Martin and Adam Hogan for reviewing this post.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.