Targets Workflow

My R-using family have a very old house with uneven heating and cooling. We are going to replace the HVAC system soon and rather than just install a new system I wanted to make data-driven decisions about the replacement. Since I have a couple of ecobee Thermostats which have remote sensors, I figured there must be an API I can use to track the temperature in various rooms of the house.

A quick search turned up an API that was even better than I hoped. My initial plan was to call the API every five minutes and write the data to a database, either self hosted or managed by DigitalOcean. This would require my code to never miss a run because it would only capture the status at that moment, which also meant not being able to go back in time to correct things. Fortunately, the API can also give historic data for any period of time in five-minute increments and this is much more useful.

In this post I will go through the process of calling the API, storing the data and building a workflow using {targets}. In the next post I’ll cover automating the process using GitHub Actions. In the third post I’ll show how I analyzed the data using time series methods.

Accessing the API

Before we can use the API we need to sign up for a key, and have an ecobee thermostat, of course. Getting the API key is explained here, though last I checked we cannot complete the process with two factor authentication (2FA) turned on, so we should disable 2FA, register for the API, then re-enable 2FA.

Using the API requires the API key and an access token, but the access token only lasts for one hour. So we are given a refresh token, which is good for a year, and can be used to get a new access token. There is probably a better way to build the URL for {httr}, but this worked. Both the refresh token and API key will be strings of letters, numbers and symbols along the lines of ha786234h1q763h.

token_request <- httr::POST(
  # it uses the token endpoint with parameters in the URL
  url="https://api.ecobee.com/token?grant_type=refresh_token&code=ha786234h1q763h&client_id=kjdf837hw7384", 
  encode='json'
)

access_token <- httr::content(token_request)$access_token

Next, we use the API to get information about our thermostats, particularly the IDs and the current times. The access token is used for bearer authentication, which is added as a header to the httr::get() call. All API requests from now on will need this access token.

thermostat_info <- httr::GET(
  # the requesting URL
  # the request is to the thermostat endpoint with parameters passes in a json body that is part of the URL
  # location is needed for the timezone offset
  'https://api.ecobee.com/1/thermostat?format=json&body={"selection":{"selectionType":"registered","selectionMatch":"","includeLocation":true}}'
  # authentication header
  # supplying a header with "Bearer access_token" tells the API who we are
  , httr::add_headers(Authorization=glue::glue('Bearer {access_token}'))
  # json is the chosen format
  , encode='json'
) %>% 
  # extract the contact into a list
  httr::content()

From this object we can get the thermostat IDs which will be useful for generating the report.

thermostat_ids <- thermostat_info$thermostatList %>% 
  purrr::map_chr('identifier') %>% 
  # make it a single-element character vector that separates IDs with a comma
  paste(collapse=',')

thermostat_ids
## [1] "28716,17611"

Given thermostat IDs we can request a report for a given time period. The report requires a startDate and endDate. By default the returned data are from midnight to 11:55 PM UTC. So a startInterval and endInterval shifts the time to the appropriate time zone which can be ascertained from thermostat_info$thermostatList[[1]]$location$timeZoneOffsetMinutes.

First thing to note is that the intervals are in groups of five minutes, so in a 24-hour day there are 288 intervals. We then account for being east or west of UTC, and subtract out a fifth of the offset. The endInterval is just one less than this startInterval in order to completely wrap around the clock. Doing these calculations should get us the entire day in our time zone.

timeOffset <- thermostat_info$thermostatList[[1]]$location$timeZoneOffsetMinutes
start_interval <- 287*(timeOffset > 0) - timeOffset/5
end_interval <- start_interval - 1

Now we build the request URL for the report. The body in the URL takes a number of parameters. We already have startDate, endDate, startInterval and endInterval. columns is a comma separated listing of the desired data points. For our purposes we want "zoneAveTemp,hvacMode,fan,outdoorTemp,outdoorHumidity,sky,wind,zoneClimate,zoneCoolTemp,zoneHeatTemp,zoneHvacMode,zoneOccupancy". The selection parameter takes two arguments: selectionType, which should be "thermostats" and selectionMatch, which is the comma-separated listing of thermostat IDs saved in thermostat_ids. We want data from the room sensors so we set includeSensors to true. Like in previous requests, this URL is built using glue::glue().

report <- httr::GET(
  glue::glue(
    # the request is to the runtimeReport endpoint with parameters passes in a json body that is part of the URL
    'https://api.ecobee.com/1/runtimeReport?format=json&body={{"startDate":"{startDate}","endDate":"{endDate}","startInterval":{startInterval},"endInterval":{endInterval},"columns":"zoneAveTemp,hvacMode,fan,outdoorTemp,outdoorHumidity,sky,wind,zoneClimate,zoneCoolTemp,zoneHeatTemp,zoneHvacMode,zoneOccupancy","selection":{{"selectionType":"thermostats","selectionMatch":"{thermostats}"}},"includeSensors":true}}'
  )
  # authentication
  , httr::add_headers(Authorization=glue::glue('Bearer {access_token}'))
  , encode='json'
) %>% 
  httr::content()

Handling the Data

Now that we have the report object we need to turn it into a nice data.frame or tibble. There are two major components to the data: thermostat information and sensor information. Multiple sensors are associated with a thermostat, including a sensor in the thermostat itself. So if our house has two HVAC zones, and hence two thermostats, we would have two sets of this information. The thermostat information includes overall data such as date, time, average temperature (average reading for all the sensors associated with a thermostat), HVACMode (heat or cool), fan speed and outdoor temperature. The sensor information has readings from each sensor associated with each thermostat such as detected occupancy and temperature.

The thermostat level and sensor level data are kept in lists inside the report object and need to be handled separately.

Thermostat Level Data

The report object has an element called reportList, which is a list where each element represents a different thermostat. For a house with two thermostats this list will have a length of two. Each of these elements contains a list called rowList. This has as many elements as the number of intervals requested, 288 for a full day (five-minute intervals for 24 hours). All of these elements are character vectors of length one, with commas separating the values. A few examples are below.

report$reportList[[1]]$rowList
## [[1]]
## [1] "2021-01-25,00:00:00,70.5,heat,75,28.4,43,5,0,Sleep,71,71,heatOff,0"
## 
## [[2]]
## [1] "2021-01-25,00:05:00,70.3,heat,300,28.4,43,5,0,Sleep,71,71,heatStage1On,0"
## 
## [[3]]
## [1] "2021-01-25,00:10:00,70.2,heat,300,28.4,43,5,0,Sleep,71,71,heatStage1On,0"
## 
## [[4]]
## [1] "2021-01-25,00:15:00,70.2,heat,300,28.4,43,5,0,Sleep,71,71,heatStage1On,0"
## 
## [[5]]
## [1] "2021-01-25,00:20:00,70.3,heat,300,28.4,43,5,0,Sleep,71,71,heatStage1On,0"

We can combine all of these into a single vector with unlist().

unlist(report$reportList[[1]]$rowList)
## [1] "2021-01-25,00:00:00,70.5,heat,75,28.4,43,5,0,Sleep,71,71,heatOff,0"      
## [2] "2021-01-25,00:05:00,70.3,heat,300,28.4,43,5,0,Sleep,71,71,heatStage1On,0"
## [3] "2021-01-25,00:10:00,70.2,heat,300,28.4,43,5,0,Sleep,71,71,heatStage1On,0"
## [4] "2021-01-25,00:15:00,70.2,heat,300,28.4,43,5,0,Sleep,71,71,heatStage1On,0"
## [5] "2021-01-25,00:20:00,70.3,heat,300,28.4,43,5,0,Sleep,71,71,heatStage1On,0"

The column names are stored in the columns element of the report object.

report$columns
## [1] "zoneAveTemp,HVACmode,fan,outdoorTemp,outdoorHumidity,sky,wind,zoneClimate,zoneCoolTemp,zoneHeatTemp,zoneHVACmode,zoneOccupancy"

This can be split into a vector of names using strsplit().

strsplit(report$columns, split=',')[[1]]
##  [1] "zoneAveTemp"     "HVACmode"        "fan"             "outdoorTemp"    
##  [5] "outdoorHumidity" "sky"             "wind"            "zoneClimate"    
##  [9] "zoneCoolTemp"    "zoneHeatTemp"    "zoneHVACmode"    "zoneOccupancy"

Perhaps a lesser known feature of readr::read_csv() is that rather than a file, it can read a character vector where each element has comma-separated values and return a tibble.

library(magrittr)
library(readr)
report$reportList[[1]]$rowList %>% 
  unlist() %>% 
  # add date and time to the specified column names
  read_csv(col_names=c('date', 'time', strsplit(report$columns, split=',')[[1]]))
date time zoneAveTemp HVACmode fan outdoorTemp outdoorHumidity sky wind zoneClimate zoneCoolTemp zoneHeatTemp zoneHVACmode zoneOccupancy
2021-01-25 00:00:00 70.5 heat 75 28.4 43 5 0 Sleep 71 71 heatOff 0
2021-01-25 00:05:00 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
2021-01-25 00:10:00 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
2021-01-25 00:15:00 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
2021-01-25 00:20:00 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0

We repeat that for each element in report$reportList and we have a nice tibble with the data we need.

library(purrr)
## 
## Attaching package: 'purrr'
## The following object is masked from 'package:magrittr':
## 
##     set_names
# using the thermostat IDs as names let's them be identified in the tibble
names(report$reportList) <- purrr::map_chr(report$reportList, 'thermostatIdentifier')
central_thermostat_info <- report$reportList %>% 
  map_df(
    ~ read_csv(unlist(.x$rowList), col_names=c('date', 'time', strsplit(report$columns, split=',')[[1]])), 
    .id='Thermostat'
  )
central_thermostat_info
Thermostat date time zoneAveTemp HVACmode fan outdoorTemp outdoorHumidity sky wind zoneClimate zoneCoolTemp zoneHeatTemp zoneHVACmode zoneOccupancy
28716 2021-01-25 00:00:00 70.5 heat 75 28.4 43 5 0 Sleep 71 71 heatOff 0
28716 2021-01-25 00:05:00 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
28716 2021-01-25 00:10:00 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
28716 2021-01-25 00:15:00 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
28716 2021-01-25 00:20:00 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
17611 2021-01-25 00:00:00 64.4 heat 135 28.4 43 5 0 Sleep 78 64 heatOff 0
17611 2021-01-25 00:05:00 64.2 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0
17611 2021-01-25 00:10:00 64.0 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0
17611 2021-01-25 00:15:00 63.9 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0
17611 2021-01-25 00:20:00 63.7 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0

Sensor Level Data

Due to the way the sensor data is stored they are more difficult to extract than the thermostat data.

First, we get the column names for the sensors. The problem is, this varies depending on how many sensors are associated with each thermostat and also the type of sensor. As of now there are three types: the thermostat itself, which measures occupancy, temperature and humidity and two kinds of remote sensors, both of which measure occupancy and temperature.

To do this we write two functions, one relates a sensor ID to a sensor name and the other joins in the result of the first function to match the column names listed in the columns element of each element in the sensorList part of the report.

relate_sensor_id_to_name <- function(sensorInfo)
{
  purrr::map_df(
    sensorInfo, 
    ~tibble::tibble(ID=.x$sensorId, name=glue::glue('{.x$sensorName}_{.x$sensorType}'))
  )
}

# see how it works
relate_sensor_id_to_name(report$sensorList[[1]]$sensors)
ID name
rs:100:2 Bedroom 1_occupancy
rs:101:1 Master Bedroom_temperature
rs:101:2 Master Bedroom_occupancy
rs2:100:1 Office 1_temperature
rs2:100:2 Office 1_occupancy
rs2:101:1 Office 2_temperature
rs:100:1 Bedroom 1_temperature
rs2:101:2 Office 2_occupancy
ei:0:1 Thermostat Temperature_temperature
ei:0:2 Thermostat Humidity_humidity
ei:0:3 Thermostat Motion_occupancy
make_sensor_column_names <- function(sensorInfo)
{
  sensorInfo$columns %>% 
    unlist() %>% 
    tibble::enframe(name='index', value='id') %>% 
    dplyr::left_join(relate_sensor_id_to_name(sensorInfo$sensors), by=c('id'='ID')) %>% 
    dplyr::mutate(name=as.character(name)) %>% 
    dplyr::mutate(name=dplyr::if_else(is.na(name), id, name))
}

# see how it works
make_sensor_column_names(report$sensorList[[1]])
index id name
1 date date
2 time time
3 rs:100:2 Bedroom 1_occupancy
4 rs:101:1 Master Bedroom_temperature
5 rs:101:2 Master Bedroom_occupancy
6 rs2:100:1 Office 1_temperature
7 rs2:100:2 Office 1_occupancy
8 rs2:101:1 Office 2_temperature
9 rs:100:1 Bedroom 1_temperature
10 rs2:101:2 Office 2_occupancy
11 ei:0:1 Thermostat Temperature_temperature
12 ei:0:2 Thermostat Humidity_humidity
13 ei:0:3 Thermostat Motion_occupancy

Then for a set of sensors we can read the data from the data element using the read_csv() trick we saw earlier. Some manipulation is needed to we pivot the data longer, keep certain rows, break apart a column using tidyr::separate() make some changes with dplyr::mutate() then pivot wider.This results in a tibble, where each row represents the occupancy and temperature reading for a particular sensor and a given five-minute increment.

extract_one_sensor_info <- function(sensor)
{
  sensor_col_names <- make_sensor_column_names(sensor)$name
  sensor$data %>% 
    unlist() %>% 
    readr::read_csv(col_names=sensor_col_names) %>% 
    # make it longer so we can easy remove rows based on a condition
    tidyr::pivot_longer(cols=c(-date, -time), names_to='Sensor', values_to='Reading') %>% 
    # we use slice because grep() returns a vector of indices, not TRUE/FALSE
    dplyr::slice(grep(pattern='_(temperature)|(occupancy)$', x=Sensor, ignore.case=FALSE)) %>% 
    # split apart the sensor name from what it's measuring
    tidyr::separate(col=Sensor, into=c('Sensor', 'Measure'), sep='_', remove=TRUE) %>% 
    # rename the actual thermostats to say thermostat
    dplyr::mutate(Sensor=sub(pattern='Thermostat .+$', replacement='Thermostat', x=Sensor)) %>% 
    # back into wide format so each sensor is its own column
    tidyr::pivot_wider(names_from=Measure, values_from=Reading)
}

# see how it works
extract_one_sensor_info(report$sensorList[[1]])
date time Sensor occupancy temperature
2021-01-25 00:00:00 Bedroom 1 0 69.5
2021-01-25 00:00:00 Master Bedroom 1 71.6
2021-01-25 00:00:00 Office 1 1 72.7
2021-01-25 00:00:00 Office 2 0 74.9
2021-01-25 00:00:00 Thermostat 0 65.8
2021-01-25 00:05:00 Bedroom 1 0 69.3
2021-01-25 00:05:00 Master Bedroom 1 71.3
2021-01-25 00:05:00 Office 1 0 72.5
2021-01-25 00:05:00 Office 2 0 74.8
2021-01-25 00:05:00 Thermostat 0 65.7
2021-01-25 00:10:00 Bedroom 1 0 69.2
2021-01-25 00:10:00 Master Bedroom 0 71.2
2021-01-25 00:10:00 Office 1 0 72.5
2021-01-25 00:10:00 Office 2 0 74.8
2021-01-25 00:10:00 Thermostat 0 65.6
2021-01-25 00:15:00 Bedroom 1 0 69.2
2021-01-25 00:15:00 Master Bedroom 1 71.2
2021-01-25 00:15:00 Office 1 0 72.5
2021-01-25 00:15:00 Office 2 0 74.8
2021-01-25 00:15:00 Thermostat 0 65.5
2021-01-25 00:20:00 Bedroom 1 0 69.4
2021-01-25 00:20:00 Master Bedroom 1 71.2
2021-01-25 00:20:00 Office 1 0 72.7
2021-01-25 00:20:00 Office 2 0 74.9
2021-01-25 00:20:00 Thermostat 0 65.4

Then we put it all together and iterate over the sets of sensors attached to each thermostat.

extract_sensor_info <- function(report)
{
    names(report$sensorList) <- purrr::map_chr(report$reportList, 'thermostatIdentifier')
    purrr::map_df(report$sensorList, extract_one_sensor_info, .id='Thermostat')
}

We now save all of the sensor information in sensor_info.

sensor_info <- extract_sensor_info(report)

Combining Sensor and Thermostat Data

To nicely align the thermostat overall readings and settings with the individual sensor readings, we join the two together using dplyr::left_join() with the Thermostat, date and time columns as keys. We also want to join in the nice thermostat names based on the thermostat IDs, so that tibble will need to be created first.

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
thermostat_names <- tibble(
  ID=map_chr(thermostat_info$thermostatList, 'identifier')
  , Name=map_chr(thermostat_info$thermostatList, 'name')
)

all_info <- inner_join(x=central_thermostat_info, y=sensor_info, by=c('Thermostat', 'date', 'time')) %>% 
  left_join(thermostat_names, by=c('Thermostat'='ID')) %>% 
  mutate(Sensor=if_else(Sensor=='Thermostat', glue::glue('{Name} Thermostat'), Sensor)) %>% 
  relocate(Name, Sensor, date, time, temperature, occupancy)

# see it
all_info
Name Sensor date time temperature occupancy Thermostat zoneAveTemp HVACmode fan outdoorTemp outdoorHumidity sky wind zoneClimate zoneCoolTemp zoneHeatTemp zoneHVACmode zoneOccupancy
Upstairs Bedroom 1 2021-01-25 00:00:00 69.5 0 28716 70.5 heat 75 28.4 43 5 0 Sleep 71 71 heatOff 0
Upstairs Master Bedroom 2021-01-25 00:00:00 71.6 1 28716 70.5 heat 75 28.4 43 5 0 Sleep 71 71 heatOff 0
Upstairs Office 1 2021-01-25 00:00:00 72.7 1 28716 70.5 heat 75 28.4 43 5 0 Sleep 71 71 heatOff 0
Upstairs Office 2 2021-01-25 00:00:00 74.9 0 28716 70.5 heat 75 28.4 43 5 0 Sleep 71 71 heatOff 0
Upstairs Upstairs Thermostat 2021-01-25 00:00:00 65.8 0 28716 70.5 heat 75 28.4 43 5 0 Sleep 71 71 heatOff 0
Upstairs Bedroom 1 2021-01-25 00:05:00 69.3 0 28716 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
Upstairs Master Bedroom 2021-01-25 00:05:00 71.3 1 28716 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
Upstairs Office 1 2021-01-25 00:05:00 72.5 0 28716 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
Upstairs Office 2 2021-01-25 00:05:00 74.8 0 28716 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
Upstairs Upstairs Thermostat 2021-01-25 00:05:00 65.7 0 28716 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
Upstairs Bedroom 1 2021-01-25 00:10:00 69.2 0 28716 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
Upstairs Master Bedroom 2021-01-25 00:10:00 71.2 0 28716 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
Upstairs Office 1 2021-01-25 00:10:00 72.5 0 28716 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
Upstairs Office 2 2021-01-25 00:10:00 74.8 0 28716 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
Upstairs Upstairs Thermostat 2021-01-25 00:10:00 65.6 0 28716 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
Upstairs Bedroom 1 2021-01-25 00:15:00 69.2 0 28716 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
Upstairs Master Bedroom 2021-01-25 00:15:00 71.2 1 28716 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
Upstairs Office 1 2021-01-25 00:15:00 72.5 0 28716 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
Upstairs Office 2 2021-01-25 00:15:00 74.8 0 28716 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
Upstairs Upstairs Thermostat 2021-01-25 00:15:00 65.5 0 28716 70.2 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
Upstairs Bedroom 1 2021-01-25 00:20:00 69.4 0 28716 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
Upstairs Master Bedroom 2021-01-25 00:20:00 71.2 1 28716 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
Upstairs Office 1 2021-01-25 00:20:00 72.7 0 28716 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
Upstairs Office 2 2021-01-25 00:20:00 74.9 0 28716 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
Upstairs Upstairs Thermostat 2021-01-25 00:20:00 65.4 0 28716 70.3 heat 300 28.4 43 5 0 Sleep 71 71 heatStage1On 0
Downstairs Living Room 2021-01-25 00:00:00 64.4 0 17611 64.4 heat 135 28.4 43 5 0 Sleep 78 64 heatOff 0
Downstairs Playroom 2021-01-25 00:00:00 63.7 0 17611 64.4 heat 135 28.4 43 5 0 Sleep 78 64 heatOff 0
Downstairs Kitchen 2021-01-25 00:00:00 66.3 0 17611 64.4 heat 135 28.4 43 5 0 Sleep 78 64 heatOff 0
Downstairs Downstairs Thermostat 2021-01-25 00:00:00 65.1 0 17611 64.4 heat 135 28.4 43 5 0 Sleep 78 64 heatOff 0
Downstairs Living Room 2021-01-25 00:05:00 64.2 0 17611 64.2 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0
Downstairs Playroom 2021-01-25 00:05:00 63.4 0 17611 64.2 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0
Downstairs Kitchen 2021-01-25 00:05:00 65.8 0 17611 64.2 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0
Downstairs Downstairs Thermostat 2021-01-25 00:05:00 64.8 0 17611 64.2 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0
Downstairs Living Room 2021-01-25 00:10:00 63.9 0 17611 64.0 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0
Downstairs Playroom 2021-01-25 00:10:00 63.2 0 17611 64.0 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0
Downstairs Kitchen 2021-01-25 00:10:00 65.7 0 17611 64.0 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0
Downstairs Downstairs Thermostat 2021-01-25 00:10:00 64.4 0 17611 64.0 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0
Downstairs Living Room 2021-01-25 00:15:00 63.7 0 17611 63.9 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0
Downstairs Playroom 2021-01-25 00:15:00 62.9 0 17611 63.9 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0
Downstairs Kitchen 2021-01-25 00:15:00 65.6 0 17611 63.9 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0
Downstairs Downstairs Thermostat 2021-01-25 00:15:00 64.1 0 17611 63.9 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0
Downstairs Living Room 2021-01-25 00:20:00 63.4 0 17611 63.7 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0
Downstairs Playroom 2021-01-25 00:20:00 62.6 0 17611 63.7 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0
Downstairs Kitchen 2021-01-25 00:20:00 65.4 0 17611 63.7 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0
Downstairs Downstairs Thermostat 2021-01-25 00:20:00 63.9 0 17611 63.7 heat 0 28.4 43 5 0 Sleep 78 64 heatOff 0

We now have a nice tibble in long format where each row shows a number of measurements and setting for a sensor for a given period of time.

Saving Data in the Cloud

Now that we have the data we need to save it somewhere. Since these data should be more resilient, cloud storage seems like the best option. There are many services including Azure, AWS, BackBlaze and GCP. But I prefer to use DigitalOcean Spaces since it has an easy interface and can take advantage of standard S3 APIs.

DigitalOcean uses slightly different terminology than AWS, in that buckets are called spaces. Otherwise we still need to deal with Access Key IDs, Secret Access Keys, Regions and Endpoints. A good tutorial on creating DigitalOcean spaces comes directly from them. In fact, they have so many great tutorials about Linux in general, I highly recommend using them for learning so much about computing.

While there is a DigitalOcean package, aptly named {analogsea}, I did not have luck getting it to work, so instead I used {aws.s3}.

After writing the all_info tibble to a CSV, we put it to the DigitalOcean bucket using aws.s3::put_object(). This requires three arguments:

  • file: The name of the file on our computer.
  • object: The path including filename inside the S3 bucket where the file will be saved.
  • bucket: The name of the S3 bucket, or in our case the space (DigitalOcean calls buckets spaces).

Implicitly, put_object() and most of the functions in {aws.s3} depend on certain environment variables being set. The first two are AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY which correspond to the DigitalOcean Access Key and Secret Key, respectively. These can be generated at https://cloud.digitalocean.com/account/api/tokens. The Access Key can be retrieved at any time, but the Secret Key is only displayed one time, so we must save it somewhere. Just like AWS, DigitalOcean has regions and this value is saved in the AWS_DEFAULT_REGION environment variable. In this case the region is nyc3. Then we need the AWS_S3_ENDPOINT environment variable. The DigitalOcean documentation makes it seem like this should be nyc3.digitaloceanspaces.com (or the appropriate region) but for our purposes it should just be digitaloceanspaces.com.

Perhaps the most common way to set environment variables in R is to save them in the VARIABLE_NAME=value format in the .Renviron file either in our home directory or project directory. But we must be sure NOT to track this file with git lest we publicly expose our secret information.

Then we can call the put_object() function.

# we make a file each day so it is important to include the date in the name.
filename <- 'all_info_2021-02-15.csv'
aws.s3::put_object(
  file=filename, object=sprintf('%s/%s', 'do_folder_name', filename), bucket='do_bucket_name'
)

After this, the file lives in a DigitalOcean space to be accessed later.

Putting it All Together with {targets}

There are a lot of moving parts to this whole process, even more so than displayed here. They could have all been put into a script but that tends to be fragile, so instead we use {targets}. This package, which is the eventual replacement for {drake}, builds a robust pipeline that schedules jobs to be run based on which jobs depend on others. It is intelligent in that it only runs jobs that are out of date (usually meaning neither the code nor data have been modified) and can run jobs in parallel. The author, Will Landau, gave an excellent talk about this at the New York Open Statistical Programming Meetup.

In order to use {targets}, we need to have a file named _targets.R in the root of our project. In there we have a list of targets, or jobs, each defined in a tar_target() function (or tar_force(), tar_change() or similar function). The first argument to tar_target() is the name of the target and the next argument is an expression, usually a function call, whose value is saved as the name.

A small example would be a target to get the access token, a target for getting thermostat information (which requires the access token), another to extract the thermostat IDs from that information and a last target to get the report based on the IDs and token.

library(targets)
list(
  tar_target(
    access_token,
    httr::POST(
      url=glue::glue("https://api.ecobee.com/token?grant_type=refresh_token&code={Sys.getenv('ECOBEE_REFRESH_TOKEN')}&client_id={Sys.getenv('ECOBEE_API_KEY')}"), 
      encode='json'
    )
  )
  , tar_target(
    thermostat_info,
    httr::GET(
      'https://api.ecobee.com/1/thermostat?format=json&body={"selection":{"selectionType":"registered","selectionMatch":"","includeLocation":true}}'
      , httr::add_headers(Authorization=sprintf('Bearer %s', access_token))
      , encode='json'
    ) %>% 
      httr::content()
  )
  , tar_target(
    thermostat_ids,
    thermostat_info$thermostatList %>% 
      purrr::map_chr('identifier') %>% 
      paste(collapse=',')
  )
  , tar_target(
    report,
    httr::GET(
        sprintf(
            'https://api.ecobee.com/1/runtimeReport?format=json&body={"startDate":"2021-01-26","endDate":"2021-01-26","startInterval":60,"endInterval":59,"columns":"zoneAveTemp,hvacMode,fan,outdoorTemp,outdoorHumidity,sky,wind,zoneClimate,zoneCoolTemp,zoneHeatTemp,zoneHvacMode,zoneOccupancy","selection":{"selectionType":"thermostats","selectionMatch":"%s"},"includeSensors":true}',
            thermostat_ids
        )
        , httr::add_headers(Authorization=sprintf('Bearer %s', access_token))
        , encode='json'
    ) %>% 
        httr::content()
  )
)

With the _targets.R file in the root of the directory, we can visualize the steps with tar_visnetwork().

tar_visnetwork()

To execute the jobs we use tar_make().

tar_make()

The results of individual jobs, known as targets, can be loaded into the working session with tar_load(target_name) or assigned to a variable with variable_name <- tar_read(target_name).

Each time this is run the appropriate file in the DigitalOcean space either gets written for the first time, or overwritten.

The actual set of targets for this project were much more involved and can be found on GitHub. Each target is the result of a custom function, which allows for more thorough documentation and testing. The version on GitHub even allows us to run the workflow for different dates so we can backfill data if needed.

Please note that anywhere there is potentially sensitive information such as the refresh token or API key, those were saved in environment variables, then accessed via Sys.getenv(). It is very important not to check the .Renviron files, where environment variables are stored, into git. They should be treated like passwords.

What’s Next?

Now that we have a functioning workflow that we can call anytime, the question becomes how do we run this on a regular basis? We could set up a cron job on a server that requires the server to always be up and we need to maintain this process. We could use scheduled lambda functions, but that’s also a lot of work. Instead, we’ll use GitHub Actions to run this workflow on a scheduled basis. We’ll go over that in the next blog post.

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Leave a reply

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> 

required