In previous posts I talked about collecting temperature data from different rooms in my house and automating that process using Docker and GitHub Actions. Now it is time to analyze that data to figure out what is happening in the house to make informed decisions about replacing the HVAC units and how I can make the house more comfortable for everyone until then.

During this process I pull the data into R from a DigitalOcean space (like an AWS S3 bucket) using {arrow}, manipulate the data with {dplyr}, {tidyr} and {tsibble}, visualize the data with {timetk}, fit models with {fable} and visualize the models with {coefplot}.

Getting the Data

As discussed in the post about tracking the data, we created a CSV for each day of data in a DigitalOcean space. The CSV tracks room temperatures, thermostat settings, outdoor temperature, etc for every five-minute interval during each day.

In the past, if multiple files had to be read into a single data.frame or tibble the best course of action would have been to combine map_df() from {purrr} with read_csv() or fread() from {readr} and {data.table}, respectively. But as seen in Wes McKinney and Neal Richardson’s talk from the 2020 New York R Conference, using open_data() from {arrow} then collecting the data with dplyr::collect() should be faster. An added bonus is that {arrow} has functionality built in to pull from an S3 bucket, and that includes DigitalOcean spaces.

The first step is to use arrow::S3FileSystem$create() to make a reference to the S3 file system. Several pieces of information are needed for this:

access_key: The DigitalOcean Access Key created at https://cloud.digitalocean.com/account/api/tokens.
secret_key: The DigitalOcean Secret Key created at https://cloud.digitalocean.com/account/api/tokens.
scheme: This should just be "https".
endpoint_override: Since we are using DigitalOcean and not AWS we need to change the default to something along the lines of "nyc3.digitaloceanspaces.com", depending on the region.

The Access Key can be retrieved at any time, but the Secret Key is only displayed one time, so we must save it somewhere.

This is all the same information used when writing the files to the DigitalOcean space in the first place. We also save it all in environment variables in the .Renviron file to avoid exposing this sensitive information in our code. It is very important to not check this file into git to reduce risk of exposing the private keys.

space <- arrow::S3FileSystem$create(
  access_key=Sys.getenv('AWS_ACCESS_KEY_ID'),
  secret_key=Sys.getenv('AWS_SECRET_ACCESS_KEY'),
  scheme="https",
  endpoint_override=glue::glue("{Sys.getenv('DO_REGION')}.{Sys.getenv('DO_BASE')}")
)
space

## S3FileSystem

The space object represents the S3 file system we have access to. Its elements are mostly file system type functions such as cd() ls(), CopyFile() and GetFileInfo() and are accessed via space$function_name(). To see a listing of files in a folder inside a bucket we call space$ls() with the bucket and folder name as the argument, in quotes.

space$ls('bucket_name/folder_name')

##  [1] "bucket_name/folder_name/2021-01-01.csv"
##  [2] "bucket_name/folder_name/2021-01-02.csv"
##  [3] "bucket_name/folder_name/2021-01-03.csv"
##  [4] "bucket_name/folder_name/2021-01-04.csv"
##  [5] "bucket_name/folder_name/2021-01-05.csv"
##  [6] "bucket_name/folder_name/2021-01-06.csv"
##  [7] "bucket_name/folder_name/2021-01-07.csv"
##  [8] "bucket_name/folder_name/2021-01-08.csv"
##  [9] "bucket_name/folder_name/2021-01-09.csv"
## [10] "bucket_name/folder_name/2021-01-10.csv"
## [11] "bucket_name/folder_name/2021-01-11.csv"
## [12] "bucket_name/folder_name/2021-01-12.csv"
## [13] "bucket_name/folder_name/2021-01-13.csv"
## [14] "bucket_name/folder_name/2021-01-14.csv"
## [15] "bucket_name/folder_name/2021-01-15.csv"
## [16] "bucket_name/folder_name/2021-01-16.csv"
## [17] "bucket_name/folder_name/2021-01-17.csv"
## [18] "bucket_name/folder_name/2021-01-18.csv"
## [19] "bucket_name/folder_name/2021-01-19.csv"
## [20] "bucket_name/folder_name/2021-01-20.csv"
## [21] "bucket_name/folder_name/2021-01-21.csv"
## [22] "bucket_name/folder_name/2021-01-22.csv"
## [23] "bucket_name/folder_name/2021-01-23.csv"
## [24] "bucket_name/folder_name/2021-01-24.csv"
## [25] "bucket_name/folder_name/2021-01-25.csv"
## [26] "bucket_name/folder_name/2021-01-26.csv"
## [27] "bucket_name/folder_name/2021-01-27.csv"
## [28] "bucket_name/folder_name/2021-01-28.csv"
## [29] "bucket_name/folder_name/2021-01-29.csv"
## [30] "bucket_name/folder_name/2021-01-30.csv"
## [31] "bucket_name/folder_name/2021-01-31.csv"
## [32] "bucket_name/folder_name/2021-02-01.csv"
## [33] "bucket_name/folder_name/2021-02-02.csv"
## [34] "bucket_name/folder_name/2021-02-03.csv"
## [35] "bucket_name/folder_name/2021-02-04.csv"
## [36] "bucket_name/folder_name/2021-02-05.csv"
## [37] "bucket_name/folder_name/2021-02-06.csv"
## [38] "bucket_name/folder_name/2021-02-07.csv"
## [39] "bucket_name/folder_name/2021-02-08.csv"
## [40] "bucket_name/folder_name/2021-02-09.csv"
## [41] "bucket_name/folder_name/2021-02-10.csv"
## [42] "bucket_name/folder_name/2021-02-11.csv"
## [43] "bucket_name/folder_name/2021-02-12.csv"
## [44] "bucket_name/folder_name/2021-02-13.csv"
## [45] "bucket_name/folder_name/2021-02-14.csv"
## [46] "bucket_name/folder_name/2021-02-15.csv"
## [47] "bucket_name/folder_name/2021-02-16.csv"
## [48] "bucket_name/folder_name/2021-02-17.csv"
## [49] "bucket_name/folder_name/2021-02-18.csv"
## [50] "bucket_name/folder_name/2021-02-19.csv"
## [51] "bucket_name/folder_name/2021-02-20.csv"
## [52] "bucket_name/folder_name/2021-02-21.csv"
## [53] "bucket_name/folder_name/2021-02-22.csv"
## [54] "bucket_name/folder_name/2021-02-23.csv"

To use open_dataset() we need a path to the folder holding the files. This is built with space$path(). The structure is "bucket_name/folder_name". For security, even the bucket and folder names are stored in environment variables. For this example, the real values have been substituted with "bucket_name" and "folder_name".

files_path <- space$path(glue::glue("{Sys.getenv('DO_SPACE')}/{Sys.getenv('DO_FOLDER')}"))

files_path

## SubTreeFileSystem: s3://bucket_name/folder_name/

Now we can call open_dataset() to get access to all the files in that folder. We need to specify format="csv" to let the function know the files are CSVs as opposed to parquet or other formats.

files_path <- space$path(glue::glue("{Sys.getenv('DO_SPACE')}/{Sys.getenv('DO_FOLDER')}"))


temps_dataset <- arrow::open_dataset(files_path, format='csv')

## FileSystemDataset with 66 csv files
## Name: string
## Sensor: string
## date: date32[day]
## time: string
## temperature: double
## occupancy: int64
## Thermostat: int64
## zoneAveTemp: double
## HVACmode: string
## fan: int64
## outdoorTemp: double
## outdoorHumidity: int64
## sky: int64
## wind: int64
## zoneClimate: string
## zoneCoolTemp: double
## zoneHeatTemp: double
## zoneHVACmode: string
## zoneOccupancy: int64

Printing out the temps_dataset object shows the column names along with the type of data stored in each column. {arrow} is pretty great and lets us do column selection and row filtering on the data sitting in files which opens up a whole world of data analysis on data too large to fit in memory. We are simply going to select columns of interest and collect all the data into one data.frame (actually a tibble). Notice we call select() before collect() because this reduces the number of columns being transmitted over the network.

library(dplyr)

temps_raw %>% head()

## # A tibble: 6 x 16
##   Name  Sensor date       time  temperature   fan zoneAveTemp HVACmode
##   <chr> <chr>  <date>     <chr>       <dbl> <int>       <dbl> <chr>   
## 1 Upst… Bedro… 2021-01-01 00:0…        67.8   300          70 heat    
## 2 Upst… Bedro… 2021-01-01 00:0…        72.4   300          70 heat    
## 3 Upst… Offic… 2021-01-01 00:0…        NA     300          70 heat    
## 4 Upst… Upsta… 2021-01-01 00:0…        63.1   300          70 heat    
## 5 Upst… Offic… 2021-01-01 00:0…        NA     300          70 heat    
## 6 Upst… Offic… 2021-01-01 00:0…        NA     300          70 heat    
## # … with 8 more variables: outdoorTemp <dbl>, outdoorHumidity <int>, sky <int>,
## #   wind <int>, zoneClimate <chr>, zoneHeatTemp <dbl>, zoneCoolTemp <dbl>,
## #   zoneHVACmode <chr>

Preparing the Data

The data are in need of some minor attention. First, the time column is read in as a character instead of time. This is a known issue, so instead we combine it with the date column using tidyr::unite() then convert this combination into a datetime (or timestamp) with lubridate::as_datetime(), which requires us to set a time zone. Then, the sensors named Upstairs Thermostat and Downstairs Thermostat actually represent Bedroom 3 and Dining Room, respectively, so we rename those using case_when() from {dplyr}, leaving all the other sensor names as is. Then we only keep rows where the time is earlier than the present. This is due to a quirk of the Ecobee API where it can return future readings, which happens because we request all of the latest day’s data.

temps <- temps_raw %>%
  tidyr::unite(col='time', date, time, sep=' ', remove=TRUE) %>%
  mutate(time=lubridate::as_datetime(time, tz=Sys.timezone())) %>%
  mutate(
    Sensor=case_when(
      Sensor == 'Upstairs Thermostat' ~ 'Bedroom 3',
      Sensor == 'Downstairs Thermostat' ~ 'Dining Room',
      TRUE ~ Sensor
    )
  ) %>%
  filter(time <= lubridate::now())

Skip All of This

Since the actual data may contain sensitive information a sanitized version is stored as parquet files on a public DigitalOcean space. This dataset is not as up to date but it will do for those that want to follow along.

publicspace <- arrow::S3FileSystem$create(
  # anonymous means we do not need to provide credentials
  # is crucial, otherwise the function looks for credentials
  anonymous=TRUE,
  # and crashes R if it can't find them
  scheme="https",
  # the data are stored in the nyc3 region
  endpoint_override='nyc3.digitaloceanspaces.com'
)
publicspace$ls('temperaturedata', recursive=TRUE)

## [1] "temperaturedata/2021"                 
## [2] "temperaturedata/2021/1"               
## [3] "temperaturedata/2021/1/part-0.parquet"
## [4] "temperaturedata/2021/2"               
## [5] "temperaturedata/2021/2/part-1.parquet"

publictemp <- arrow::open_dataset(publicspace$path("temperaturedata")) %>%
    collect()


publictemp %>% head()

## # A tibble: 6 x 15
##   Name  Sensor time                temperature   fan zoneAveTemp HVACmode
##   <chr> <chr>  <dttm>                    <dbl> <int>       <dbl> <chr>   
## 1 Upst… Bedro… 2021-01-01 00:00:00        67.8   300          70 heat    
## 2 Upst… Bedro… 2021-01-01 00:00:00        72.4   300          70 heat    
## 3 Upst… Offic… 2021-01-01 00:00:00        NA     300          70 heat    
## 4 Upst… Bedro… 2021-01-01 00:00:00        63.1   300          70 heat    
## 5 Upst… Offic… 2021-01-01 00:00:00        NA     300          70 heat    
## 6 Upst… Offic… 2021-01-01 00:00:00        NA     300          70 heat    
## # … with 8 more variables: outdoorTemp <dbl>, outdoorHumidity <int>, sky <int>,
## #   wind <int>, zoneClimate <chr>, zoneHeatTemp <dbl>, zoneCoolTemp <dbl>,
## #   zoneHVACmode <chr>

Visualizing with Interactive Plots

The first step in any good analysis is visualizing the data. In the past, I would have recommended {dygraphs} but have since come around to {feasts} from Rob Hyndman’s team and {timetk} from Matt Dancho due to their ease of use, better flexibility with data types and because I personally think they are more visually pleasing.

In order to show the outdoor temperature on the same footing as the sensor temperatures, we select the pertinent columns with the outdoor data from the temps tibble and stack them at the bottom of temps, after accounting for duplicates (Each sensor has an outdoor reading, which is the same across sensors).

all_temps <- dplyr::bind_rows(
  temps,
  temps %>% 
    dplyr::select(Name, Sensor, time, temperature=outdoorTemp) %>% 
    dplyr::mutate(Sensor='Outdoors', Name='Outdoors') %>% 
    dplyr::distinct()
)

Then we use plot_time_series() from {timetk} to make a plot that has a different colored line for each sensor and for the outdoors (with sensors being grouped into facets according to which thermostat they are associated with). Ordinarily the interactive version would be better (.interactive=TRUE below), but for the sake of this post static displays better.

library(timetk)
all_temps %>%
  group_by(Name) %>%
  plot_time_series(
    .date_var=time, 
    .value=temperature, .color_var=Sensor, .smooth=FALSE,
    .interactive=FALSE
  )

## Warning: Removed 34776 row(s) containing missing values (geom_path).

From this we can see a few interesting patterns. Bedroom 1 is consistently higher than the other bedrooms, but lower than the offices which are all on the same thermostat. All the downstairs rooms see their temperatures drop overnight when the thermostat is lowered to conserve energy. Between February 12 and 18 this dip was not as dramatic, due to running an experiment to see how raising the temperature on the downstairs thermostat would affect rooms on the second floor.

Computing Cross Correlations

One room in particular, Bedroom 2, always seemed colder than all the rest of the rooms, so I wanted to figure out why. Was it affected by physically adjacent rooms? Or maybe it was controlled by the downstairs thermostat rather than the upstairs thermostat as presumed.

So the first step was to see which rooms were correlated with each other. But computing correlation with time series data isn’t as simple because we need to account for lagged correlations. This is done with the cross-correlation function (ccf).

In order to compute the ccf, we need to get the data into a time series format. While R has extensive formats, most notably the built-in ts and its extension xts by Jeff Ryan, we are going to use tsibble, from Rob Hyndman and team, which are tibbles with a time component.

The tsibble object can treat multiple time series as individual data and this requires the data to be in long format. For this analysis, we want to treat multiple time series as interconnected, so we need to put the data into wide format using pivot_wider().

wide_temps <- all_temps %>% 
  # we only need these columns
  select(Sensor, time, temperature) %>% 
  # make each time series its own column
  tidyr::pivot_wider(
    id_cols=time, 
    names_from=Sensor, 
    values_from=temperature
  ) %>% 
  # convert into a tsibble
  tsibble::as_tsibble(index=time) %>% 
  # fill down any NAs
  tidyr::fill()


wide_temps

## # A tibble: 15,840 x 12
##    time                `Bedroom 2` `Bedroom 1` `Office 1` `Bedroom 3` `Office 3`
##    <dttm>                    <dbl>       <dbl>      <dbl>       <dbl>      <dbl>
##  1 2021-01-01 00:00:00        67.8        72.4         NA        63.1         NA
##  2 2021-01-01 00:05:00        67.8        72.8         NA        62.7         NA
##  3 2021-01-01 00:10:00        67.9        73.3         NA        62.2         NA
##  4 2021-01-01 00:15:00        67.8        73.6         NA        62.2         NA
##  5 2021-01-01 00:20:00        67.7        73.2         NA        62.3         NA
##  6 2021-01-01 00:25:00        67.6        72.7         NA        62.5         NA
##  7 2021-01-01 00:30:00        67.6        72.3         NA        62.7         NA
##  8 2021-01-01 00:35:00        67.6        72.5         NA        62.5         NA
##  9 2021-01-01 00:40:00        67.6        72.7         NA        62.4         NA
## 10 2021-01-01 00:45:00        67.6        72.7         NA        61.9         NA
## # … with 15,830 more rows, and 6 more variables: `Office 2` <dbl>, `Living
## #   Room` <dbl>, Playroom <dbl>, Kitchen <dbl>, `Dining Room` <dbl>,
## #   Outdoors <dbl>

Some rooms’ sensors were added after the others so there are a number of NAs at the beginning of the tracked time.

We compute the ccf using CCF() from the {feasts} package then generate a plot by piping that result into autoplot(). Besides a tsibble, CCF() needs two arguments, the columns whose cross-correlation is being computed.

library(feasts)
wide_temps %>%
  CCF(`Living Room`, `Bedroom 2`) %>%
  autoplot()

Cross-correlation between Living Room and Bedroom 2

The negative lags are when the first column is a leading indicator of the second column and the positive lags are when the first column is a lagging indicator of the second column. In this case, Living Room seems to follow Bedroom 2 rather than lead.

We want to compute the ccf between Bedroom 2 and every other room. We call CCF() on each column in wide_temps against Bedroom 2, including Bedroom 2 itself, by iterating over the columns of wide_temps with map(). We use a ~ to treat CCF() as a quoted function and wrap .x in double curly braces to take advantage of non-standard evaluation. This creates a list of objects that can be plotted with autoplot(). We make a list of ggplot objects by iterating over the ccf objects using imap(). This allows us to use the name of each element of the list in ggtitle. Lastly, wrap_plots() from {patchwork} is used to show all the plots in one display.

library(patchwork)
library(ggplot2)


room_ccf <- wide_temps %>% 
  purrr::map(~CCF(wide_temps, {{.x}}, `Bedroom 2`))


# we don't want the CCF with time, outdoor temp or the Bedroom 2 itself
room_ccf$time <- NULL
room_ccf$Outdoors <- NULL
room_ccf$`Bedroom 2` <- NULL


biggest_cor <- room_ccf %>% purrr::map_dbl(~max(.x$ccf)) %>% max()


room_ccf_plots <- purrr::imap(
  room_ccf, 
  ~ autoplot(.x) +
    # show the maximally cross-correlation value
    # so it easier to compare rooms
    geom_hline(yintercept=biggest_cor, color='red', linetype=2) + 
    ggtitle(.y) + 
    labs(x=NULL, y=NULL) + 
    ggthemes::theme_few()
)


wrap_plots(room_ccf_plots, ncol=3)

Cross-correlation between each room and Bedroom 2

This brings up an odd result. Office 1, Office 2, Office 3, Bedroom 1, Bedroom 2 and Bedroom 3 are all controlled by the same thermostat, but Bedroom 2 is only really cross-correlated with Bedroom 3 (and negatively with Office 3). Instead, Bedroom 2 is most correlated with Playroom and Living Room, both of which are controlled by a different thermostat.

So it will be interesting to see which thermostat setting (the set desired temperature) is most cross-correlated with Bedroom 2. At the same time we’ll check how much the outdoor temperature cross-correlates with Bedroom 2 as well.

temp_settings <- all_temps %>% 
  # there is a column for outdoor temperatures so we don't need these rows
  filter(Name != 'Outdoors') %>% 
  select(Name, time, HVACmode, zoneCoolTemp, zoneHeatTemp) %>% 
  # this information is repeated for each sensor, so we only need it once
  distinct() %>% 
  # depending on the setting, the set temperature is in two columns
  mutate(setTemp=if_else(HVACmode == 'heat', zoneHeatTemp, zoneCoolTemp)) %>% 
  tidyr::pivot_wider(id_cols=time, names_from=Name, values_from=setTemp) %>% 
  # get the reading data
  right_join(wide_temps %>% select(time, `Bedroom 2`, Outdoors), by='time') %>% 
  as_tsibble(index=time)


temp_settings

## # A tibble: 15,840 x 5
##    time                Upstairs Downstairs `Bedroom 2` Outdoors
##    <dttm>                 <dbl>      <dbl>       <dbl>    <dbl>
##  1 2021-01-01 00:00:00       71         72        67.8     31.3
##  2 2021-01-01 00:05:00       71         72        67.8     31.3
##  3 2021-01-01 00:10:00       71         72        67.9     31.3
##  4 2021-01-01 00:15:00       71         66        67.8     31.3
##  5 2021-01-01 00:20:00       71         66        67.7     31.3
##  6 2021-01-01 00:25:00       71         66        67.6     31.3
##  7 2021-01-01 00:30:00       71         66        67.6     31.2
##  8 2021-01-01 00:35:00       71         66        67.6     31.2
##  9 2021-01-01 00:40:00       71         66        67.6     31.2
## 10 2021-01-01 00:45:00       71         66        67.6     31.2
## # … with 15,830 more rows

Making this tsibble longer allows us to use the built-in grouping and iteration that comes with CCF() and other functions in {fable}.

temp_settings_long <- temp_settings %>% 
  tidyr::pivot_longer(
    cols=-c(time, `Bedroom 2`), 
    names_to='Thermostat', values_to='Temperature'
  )

room_thermostat_ccf <- temp_settings_long %>% 
  CCF(Temperature, `Bedroom 2`)

room_thermostat_ccf %>% autoplot() + ggthemes::theme_few()

Cross-correlation between Bedroom 2 and thermostat settings and the outdoor temperature

This plot makes a strong argument for Bedroom 2 being more affected by the downstairs thermostat than the upstairs like we presume. But I don’t think the downstairs thermostat is actually controlling the vents in Bedroom 2 because I have looked at the duct work and followed the path (along with a professional HVAC technician) from the furnace to the vent in Bedroom 2. What I think is more likely is that the rooms downstairs get so cold (I lower the temperature overnight to conserve energy), and there is not much insulation between the floors, so the vent in Bedroom 2 can’t pump enough warm air to compensate.

I did try to experiment for a few nights (February 12 and 18) by not lowering the downstairs temperature but the eyeball test didn’t reveal any impact. Perhaps a proper A/B test is called for.

Fitting Time Series Models with `{fable}`

Going beyond cross-correlations, an ARIMA model with exogenous variables can give an indication if input variables have a significant impact on a time series while accounting for autocorrelation and seasonality. We want to model the temperature in Bedroom 2 while using all the other rooms, the outside temperature and the thermostat settings to see which affected Bedroom 2 the most. We fit a number of different models then compare them using AICc (AIC corrected) to see which fits the best.

To do this we need a tsibble with each of these variables as a column. This requires us to join temp_settings with wide_temps.

all_ts <- wide_temps %>% 
  left_join(
    temp_settings %>% select(time, Upstairs, Downstairs),
    by='time'
  )

Now we use model() from {fable} (technically from {fabletools}) to fit a number of models. We make it run in parallel with plan() from {future}.

library(future)
plan(multisession, workers=10)

library(tictoc)

library(fable)

tic()
ts_mods <- all_ts %>% 
  model(
    mod0=ARIMA(`Bedroom 2`)
    , mod1=ARIMA(`Bedroom 2` ~ `Bedroom 1`)
    , mod2=ARIMA(`Bedroom 2` ~ `Bedroom 3`)
    , mod3=ARIMA(`Bedroom 2` ~ `Bedroom 1` + `Bedroom 3`)
    , mod4=ARIMA(`Bedroom 2` ~ `Living Room`)
    , mod5=ARIMA(`Bedroom 2` ~ Playroom)
    , mod6=ARIMA(`Bedroom 2` ~ `Living Room` + Playroom)
    , mod7=ARIMA(`Bedroom 2` ~ `Living Room` + Playroom + 
                   Kitchen + `Dining Room`
    )
    , mod8=ARIMA(`Bedroom 2` ~ `Bedroom 1` + `Bedroom 3` + 
                   `Living Room` + Playroom + 
                   Kitchen + `Dining Room`
    )
    , mod9=ARIMA(`Bedroom 2` ~ Outdoors)
    , mod10=ARIMA(`Bedroom 2` ~ Upstairs)
    , mod11=ARIMA(`Bedroom 2` ~ Downstairs)
    , mod12=ARIMA(`Bedroom 2` ~ Downstairs + Upstairs)
    , mod13=ARIMA(`Bedroom 2` ~ Outdoors + Downstairs + Upstairs)
    , mod14=ARIMA(`Bedroom 2` ~ Outdoors + 
                    Downstairs + Upstairs + 
                    `Living Room` + Playroom + 
                    Kitchen + `Dining Room`
    )
    , mod15=ARIMA(`Bedroom 2` ~ Outdoors + 
                    Downstairs + Upstairs + 
                    `Bedroom 1` + `Bedroom 3`
    )
    , mod16=ARIMA(`Bedroom 2` ~ Outdoors + 
                    Downstairs + Upstairs + 
                    `Bedroom 1` + `Bedroom 3` +
                    `Living Room` + Playroom + 
                    Kitchen + `Dining Room`
    )
  )
toc()

## 99.41 sec elapsed

This results in a tibble of models (termed a mable) with one column per model. We use glance() to see the AICc for each, along with other information.

ts_mods %>% 
  glance() %>% 
  arrange(AICc)

## # A tibble: 17 x 8
##    .model sigma2 log_lik     AIC    AICc     BIC ar_roots  ma_roots  
##    <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <list>    <list>    
##  1 mod3   0.0247   6723. -13427. -13427. -13358. <cpl [3]> <cpl [14]>
##  2 mod1   0.0247   6714. -13411. -13411. -13350. <cpl [3]> <cpl [14]>
##  3 mod15  0.0247   6709. -13394. -13394. -13302. <cpl [3]> <cpl [3]> 
##  4 mod4   0.0257   6393. -12768. -12768. -12699. <cpl [3]> <cpl [25]>
##  5 mod2   0.0264   6199. -12381. -12381. -12320. <cpl [2]> <cpl [15]>
##  6 mod9   0.0264   6190. -12365. -12365. -12304. <cpl [2]> <cpl [26]>
##  7 mod10  0.0264   6190. -12365. -12365. -12303. <cpl [2]> <cpl [26]>
##  8 mod0   0.0264   6189. -12365. -12365. -12311. <cpl [2]> <cpl [26]>
##  9 mod13  0.0264   6176. -12331. -12331. -12255. <cpl [2]> <cpl [26]>
## 10 mod12  0.0264   6175. -12331. -12331. -12262. <cpl [2]> <cpl [26]>
## 11 mod11  0.0264   6174. -12331. -12331. -12270. <cpl [2]> <cpl [26]>
## 12 mod6   0.0195   4940.  -9863.  -9863.  -9802. <cpl [5]> <cpl [0]> 
## 13 mod5   0.0195   4932.  -9849.  -9849.  -9796. <cpl [5]> <cpl [0]> 
## 14 mod8   0.0163   4603.  -9184.  -9184.  -9100. <cpl [4]> <cpl [0]> 
## 15 mod16  0.0163   4605.  -9181.  -9181.  -9074. <cpl [4]> <cpl [0]> 
## 16 mod7   0.0171   4364.  -8708.  -8708.  -8631. <cpl [5]> <cpl [0]> 
## 17 mod14  0.0171   4365.  -8704.  -8704.  -8605. <cpl [5]> <cpl [0]>

ts_mods %>% 
  glance() %>% 
  # sort .model according to AICc for better plotting
  mutate(.model=forcats::fct_reorder(.model, .x=AICc, .fun=sort)) %>% 
  # smaller is better for AICc
  # so negate it so that the tallest bar is best
  ggplot(aes(x=.model, y=-AICc)) + 
  geom_col() + 
  theme(axis.text.x=element_text(angle=25))

The best two models (as judged by AICc) are mod3 and mod1, some of the simpler models. The key thing they have in common is that Bedroom 1 is a predictor, which somewhat contradicts the findings from the cross-correlation function. We can examine mod3 by selecting it out of ts_mods then calling report().

ts_mods %>% 
  select(mod3) %>% 
  report()

## Series: Bedroom 2 
## Model: LM w/ ARIMA(3,1,2)(0,0,1)[12] errors 
## 
## Coefficients:
##          ar1      ar2     ar3     ma1     ma2     sma1  `Bedroom 1`
##       0.4497  -0.5184  0.0446  0.0650  0.3160  -0.0049       0.2723
## s.e.  0.0593   0.0471  0.0265  0.0589  0.0274   0.0082       0.0082
##       `Bedroom 3`
##            0.0396
## s.e.       0.0092
## 
## sigma^2 estimated as 0.02469:  log likelihood=6722.64
## AIC=-13427.28   AICc=-13427.27   BIC=-13358.25

This shows that Bedroom 1 and Bedroom 3 are significant predictors with SARIMA(3,1,2)(0,0,1) errors.

The third best model is one of the more complicated models, mod15. Not all the predictors are significant, though we should not judge a model, or the predictors, based on that.

ts_mods %>% 
  select(mod15) %>% 
  report()

## Series: Bedroom 2 
## Model: LM w/ ARIMA(3,1,3) errors 
## 
## Coefficients:
##          ar1      ar2      ar3     ma1     ma2     ma3  Outdoors  Downstairs
##       0.1870  -0.4426  -0.0724  0.3288  0.3745  0.0886   -0.0037     -0.0015
## s.e.  0.2351   0.0928   0.1058  0.2352  0.0490  0.0692    0.0026      0.0025
##       Upstairs  `Bedroom 1`  `Bedroom 3`
##         0.0087       0.2718       0.0387
## s.e.    0.0066       0.0081       0.0092
## 
## sigma^2 estimated as 0.02468:  log likelihood=6709.22
## AIC=-13394.43   AICc=-13394.41   BIC=-13302.39

We can visualize this model with {coefplot} after stripping out the time series coefficients.

library(coefplot)
ts_mods %>% 
    select(mod15) %>% 
    coef() %>% 
    filter(!stringr::str_detect(term, '^ar|ma\\d+$')) %>%
    select(Coefficient=term, Value=estimate, SE=std.error, Model=.model) %>%
    mutate(
        LowOuter=Value - 2*SE, LowInner=Value - 1*SE,
        HighInner=Value + 1*SE, HighOuter=Value + 2*SE
    ) %>%
    arrange(Value) %>%
    mutate(Coefficient=factor(Coefficient, levels=Coefficient)) %>%
    coefplot()

What all these top performing models have in common is the inclusion of the other bedrooms as predictors.

While helping me interpret the data, my wife pointed out that all the colder rooms are on one side of the house while the warmer rooms are on the other side. That colder side is the one that gets hit by winds which are rather strong. This made me remember a long conversation I had with a very friendly and knowledgeable sales person at GasTec who said that wind can significantly impact a house’s heat due to it blowing away the thermal envelope. I wanted to test this idea, but while the Ecobee API returns data on wind, the value is always 0, so it doesn’t tell me anything. Hopefully, I can get that fixed.

Running Everything with `{targets}`

There were many moving parts to the analysis, so I managed it all with {targets}. Each step seen in this post is a separate target in the workflow. There is even a target to build an html report from an R Markdown file. The key is to load objects created by the workflow into the R Markdown document with tar_read() or tar_load(). Then tar_render() will figure out how the R Markdown document figures into the computational graph.

tar_render(
  results,
  'Analysis.Rmd'
)

I went over {targets} more thoroughly in the previous post and I recommend Will Landau’s talk at the New York Open Statistical Programming Meetup to learn more.

Reporting with RStudio Connect

In addition to the R Markdown I made a shiny dashboard using the R Markdown flexdashboard format. I host this on RStudio Connect so I can get a live view of the data and examine the models. Besides from being a partner with RStudio, I really love this product and I use it for all of my internal reporting (and side projects) and we have a whole practice around getting companies up to speed with the tool.

What Comes Next?

So what can be done with this information? Based on the ARIMA model the other bedrooms on the same floor are predictive of Bedroom 2, which makes sense since they are controlled by the same thermostat. Bedroom 3 is consistently one of the coldest in the house and shares a wall with Bedroom 2, so maybe that wall could use some insulation. Bedroom 1 on the other hand is the hottest room on that floor, so my idea is to find a way to cool off that room (we sealed the vents with magnetic covers), thus lowering the average temperature on the floor, causing the heat to kick in, raising the temperature everywhere, including Bedroom 2.

The cross-correlation functions also suggest a relationship with Playroom and Living Room, which are some of the colder rooms downstairs and the former is directly underneath Bedroom 2. I’ve already put caulking cord in the gaps around the floorboards in the Playroom and that has noticeably raised the temperature in there, so hopefully that will impact Bedroom 2. We already have storm windows on just about every window and have added thermal curtains to Bedroom 2, Bedroom 3 and the Living Room. My wife and I also discovered air billowing in from around, not in, the fireplace in the Living Room, so if we can close that source of cold air, that room should get warmer then hopefully Bedroom 2 as well. Lastly, we want to insulate between the first floor ceiling and second floor.

While we make these fixes, we’ll keep collecting data and trying new ways to analyze it all. It will be interesting to see how the data change in the summer.

These three posts were intended to show the entire process from data collection to automation (what some people are calling robots) to analysis, and show how data can be used to make actionable decisions. This is one more way I’m improving my life with data.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

In a recent post I talked about collecting temperature data from different rooms in my house. Using {targets}, I am able to get temperature readings for any given day down to five-minute increments and store that data on a DigitalOcean space.

Since I want to be able to both fit models on the data and track temperatures close to real time I needed this to run on a regular basis, say every 30-60 minutes. I have a machine learning server that Kaz Sakamoto, Michael Beigelmacher and I built from parts we bought and a collection of Raspberry Pis, but I didn’t want to maintain the hardware for this, I wanted something that was handled for me.

Luckily GitHub Actions can run arbitrary code on a schedule and provides free compute for public repositories. So I built a Docker image and GitHub Actions configuration files to make it all happen. This all is based on the templogging repository on GitHub.

Docker Image

To ensure everything works as expected and to prevent package updates from breaking code, I used a combination of {renv} and Docker.

{renv} allows each individual project to have its own set of packages isolated from all other projects’ packages. It also keeps track of package versions so that the project can be restored on other computers, or in a Docker container, in the exact same configuration.

Using {renv} is fairly automatic. When starting a new project run renv::init() to create the isolated package library and start tracking packages. Then packages are installed as usual with install.packages() or renv::install(). Then renv::snapshot() is periodically used to write the specific package versions to the renv.lock file which is saved in the root directory of the project.

Docker is like a lightweight virtual machine that can create a specific environment. For this project I wanted a specific version of R (4.0.3), a handful of system libraries like libxml2-dev (for {igraph}, which is needed for {targets}) and libcurl4-openssl-dev (for {httr}) and all of the packages tracked with {renv}.

The R version is provided by starting with the corresponding r-ver image from the rocker project. The system libraries are installed using apt install.

The R packages are installed using renv::restore() but this requires files from the project to be added to the image in the right order. First, the templogging directory is created to hold everything, then individual files needed for the project and {renv} are added to that directory: templogging.Rproj, .Rprofile and renv.lock. Then the entire renv directory is added as well. After that, the working directory is changed to the templogging directory so that the following commands take place in that location.

Then comes time to install all the packages tracked in renv.lock. This is done with RUN Rscript -e "renv::restore(repos='https://packagemanager.rstudio.com/all/__linux__/focal/latest', confirm=FALSE)". Rscript -e runs R code as if it were executed in an R session. renv::restore() installs all the specific packages. Setting repos='https://packagemanager.rstudio.com/all/__linux__/focal/latest' specifies to use prebuilt binaries from the public RStudio Package Manager (some people need the commercial, private version and if you think this may be you, my company, Lander Analytics can help you get started).

After the packages are installed, the rest of the files in the project are copied into the image with ADD . /templogging. It is important that this step takes place after the packages are installed so that changes to code does not trigger a reinstall of the packages.

This is the complete Dockerfile.

FROM rocker/r-ver:4.0.3

# system libraries
RUN apt update && \
    apt install  -y --no-install-recommends \
    # for igraph
    libxml2-dev \
    libglpk-dev \
    libgmp3-dev \
    # for httr
    libcurl4-openssl-dev \
    libssl-dev && \
    # makes the image smaller
    rm -rf /var/lib/apt/lists/*

# create project directory
RUN mkdir templogging
# add some specific files
ADD ["templogging.Rproj", ".Rprofile", "renv.lock", "/templogging/"]
# add all of the renv folder, except the library which is marked out in .dockerignore
ADD renv/ /templogging/renv

# make sure we are in the project directory so restoring the packages works cleanly
WORKDIR /templogging

# this restores the desired packages (including specific versions)
# it uses the public RStudio package manager to get binary versions of the packages
# this is for faster installation
RUN Rscript -e "renv::restore(repos='https://packagemanager.rstudio.com/all/__linux__/focal/latest', confirm=FALSE)"

# then we add in the rest of the project folder, including  all the code
# we do this separately so that we can change code without having to reinstall all the packages
ADD . /templogging

To make building and running the image easier I made a docker-compose.yml file.

version: '3.4'

services:
  templogging:
    image: jaredlander/templogging
    container_name: tempcheck
    build:
      context: .
      dockerfile: ./Dockerfile
    environment:
      - BUCKET_NAME=${BUCKET_NAME}
      - FOLDER_NAME=${FOLDER_NAME}
      - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
      - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
      - AWS_DEFAULT_REGION=${AWS_DEFAULT_REGION}
      - AWS_S3_ENDPOINT=${AWS_S3_ENDPOINT}
      - AWS_SESSION_TOKEN=${AWS_SESSION_TOKEN}
      - TZ=${TZ}

So we can now build and upload the image with the following shell code.

docker-compose build
docker push jaredlander/templogging:latest

Now the image is available to be used anywhere. Since this image, and all the code are public, it’s important that nothing personal or private was saved in the code. As can be seen in the docker-compose.yml file or any of the R code in the repo, all potentially sensitive information was stored in environment variables.

The {targets} workflow can be executed inside the Docker container with the following shell command.

docker-compose run templogging R -e "targets::tar_make()"

That will get and write data for the current date. To run this for a specific date, a variable called process_date should be set to that date (assumes the TZ environment variable has been set) and tar_make() should be called with the callr_function argument set to NULL. I’m sure there is a better way to set arguments to make tar_make() of runtime settings, but I haven’t figured it out yet.

docker-compose run templogging R -e "process_date <- as.Date('2021-02-20')" -e "targets::tar_make(callr_function=NULL)"

GitHub Actions

After the Docker image was built, I automated the whole process using GitHub Actions. This requires a yml file inside the .github/workflows folder in the repository.

The first section is on:. This tells GitHub to run this job whenever there is a push to the main or master branch and whenever there is a pull request to these branches. It also runs every 30 minutes thanks to the schedule block that has - cron: '*/30 * * * *'. The */30 means “every thirty” and it’s the first position which means minutes (0-59). The second position would be hours (0-23). The following positions are day of month (1-31), month of year (1-12), day of the week (1-7, with 1 as Monday) and year (1900-3000).

The next section, name:, just provides the name of the workflow.

Then comes the jobs: section. Multiple jobs can be run, but there is only one job called Run-Temp-Logging.

The job runs on (runs-on:) ubuntu-latest as opposed to Windows or macOS.

This is going to use the Docker container so the container: block specifies an image: and environment variables (env:). The environment variables are stored securely in GitHub and are accessed via ${{ secrets.VARIABLE_NAME }}. Since I created this I have been debating if it would be faster to install the packages directly in the virtual machine spun up by the actions runner than to use a Docker image. Given that R package installations can be cached it might work out, but I haven’t tested it yet.

The steps: block runs the {targets} workflow. There can be multiple steps, though in this case there is only one, so each starts with a dash (-). For the step, a name: is given, along with what to run (run: R -e "targets::tar_make()") and the working directory (working-directory: /templogging), each on a separate line in the file. The R -e "targets::tar_make()" command runs in the Docker container with /templogging as the working directory. This has the same result as running docker-compose run templogging R -e "targets::tar_make()" locally. Each time this is run the file in the DigitalOcean space gets written anew.

The entire yml file is below.

on:
    push:
        branches:
            - main
            - master
    pull_request:
        branches:
            - main
            - master
    schedule:
        - cron: '*/30 * * * *'

name: Run-Targets

jobs:
    Run-Temp-Logging:
        runs-on: ubuntu-latest
        container:
            image: jaredlander/templogging:latest
            env:
                BUCKET_NAME: ${{ secrets.BUCKET_NAME }}
                FOLDER_NAME: ${{ secrets.FOLDER_NAME }}
                AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
                AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
                AWS_DEFAULT_REGION: ${{ secrets.AWS_DEFAULT_REGION }}
                AWS_S3_ENDPOINT: ${{ secrets.AWS_S3_ENDPOINT }}
                AWS_SESSION_TOKEN: ""
                ECOBEE_API_KEY: ${{ secrets.ECOBEE_API_KEY }}
                ECOBEE_REFRESH_TOKEN: ${{ secrets.ECOBEE_REFRESH_TOKEN }}
        steps:
            - name: Run targets workflow
              run: R -e "targets::tar_make()"
              working-directory: /templogging

With this running every 30 minutes I can pull the data just about any time and have close-enough-to-live insight into my house. But let’s be honest, I really only need to see up to the past few days to make sure everything is all right, so a 30-minute delay is more than good enough.

End of Day

While the job is scheduled to run every thirty minutes, I noticed that it wasn’t exact, so I couldn’t be sure I would capture the last few five-minute intervals of each day. So I made another GitHub Action to pull the prior day’s data at 1:10 AM.

The instructions are mostly the same with just two changes. First, the schedule is - cron: '10 1 * * *' which means on the 10th minute of the first hour. Since the rest of the slots are * the job occurs on every one of them, so it is the 10th minute of the first hour of every day of every month of every year.

The second change is the command given to the Docker container to run the {targets} workflow. As alluded to earlier, the _targets.R file is designed so that if Process_date is assigned a date, the data will be pulled for that date instead of the current date. So the run: portion has the command R -e "Sys.setenv(TZ='${{ secrets.TZ }}')" -e "process_date <- Sys.Date() - 1" -e "targets::tar_make(callr_function=NULL)". This makes sure the timezone environment variable is set in R, process_date gets the value of the prior day’s date and tar_make() is called.

The entire yml file is below.

on:
    push:
        branches:
            - main
            - master
    pull_request:
        branches:
            - main
            - master
    schedule:
        - cron: '10 1 * * *'

name: Run-Targets

jobs:
    Run-Temp-Logging:
        runs-on: ubuntu-latest
        container:
            image: jaredlander/templogging:latest
            env:
                BUCKET_NAME: ${{ secrets.BUCKET_NAME }}
                FOLDER_NAME: ${{ secrets.FOLDER_NAME }}
                AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
                AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
                AWS_DEFAULT_REGION: ${{ secrets.AWS_DEFAULT_REGION }}
                AWS_S3_ENDPOINT: ${{ secrets.AWS_S3_ENDPOINT }}
                AWS_SESSION_TOKEN: ""
                ECOBEE_API_KEY: ${{ secrets.ECOBEE_API_KEY }}
                ECOBEE_REFRESH_TOKEN: ${{ secrets.ECOBEE_REFRESH_TOKEN }}
        steps:
            - name: Run targets workflow
              run: R -e "Sys.setenv(TZ='${{ secrets.TZ }}')" -e "process_date <- Sys.Date() - 1" -e "targets::tar_make(callr_function=NULL)"
              working-directory: /templogging

What’s Next?

Now that the data are being collected it’s time to analyze and see what is happening in the house. This is very involved so that will be the subject of the next post.

My R-using family have a very old house with uneven heating and cooling. We are going to replace the HVAC system soon and rather than just install a new system I wanted to make data-driven decisions about the replacement. Since I have a couple of ecobee Thermostats which have remote sensors, I figured there must be an API I can use to track the temperature in various rooms of the house.

A quick search turned up an API that was even better than I hoped. My initial plan was to call the API every five minutes and write the data to a database, either self hosted or managed by DigitalOcean. This would require my code to never miss a run because it would only capture the status at that moment, which also meant not being able to go back in time to correct things. Fortunately, the API can also give historic data for any period of time in five-minute increments and this is much more useful.

In this post I will go through the process of calling the API, storing the data and building a workflow using {targets}. In the next post I’ll cover automating the process using GitHub Actions. In the third post I’ll show how I analyzed the data using time series methods.

Accessing the API

Before we can use the API we need to sign up for a key, and have an ecobee thermostat, of course. Getting the API key is explained here, though last I checked we cannot complete the process with two factor authentication (2FA) turned on, so we should disable 2FA, register for the API, then re-enable 2FA.

Using the API requires the API key and an access token, but the access token only lasts for one hour. So we are given a refresh token, which is good for a year, and can be used to get a new access token. There is probably a better way to build the URL for {httr}, but this worked. Both the refresh token and API key will be strings of letters, numbers and symbols along the lines of ha786234h1q763h.

token_request <- httr::POST(
  # it uses the token endpoint with parameters in the URL
  url="https://api.ecobee.com/token?grant_type=refresh_token&code=ha786234h1q763h&client_id=kjdf837hw7384", 
  encode='json'
)

access_token <- httr::content(token_request)$access_token

Next, we use the API to get information about our thermostats, particularly the IDs and the current times. The access token is used for bearer authentication, which is added as a header to the httr::get() call. All API requests from now on will need this access token.

thermostat_info <- httr::GET(
  # the requesting URL
  # the request is to the thermostat endpoint with parameters passes in a json body that is part of the URL
  # location is needed for the timezone offset
  'https://api.ecobee.com/1/thermostat?format=json&body={"selection":{"selectionType":"registered","selectionMatch":"","includeLocation":true}}'
  # authentication header
  # supplying a header with "Bearer access_token" tells the API who we are
  , httr::add_headers(Authorization=glue::glue('Bearer {access_token}'))
  # json is the chosen format
  , encode='json'
) %>% 
  # extract the contact into a list
  httr::content()

From this object we can get the thermostat IDs which will be useful for generating the report.

thermostat_ids <- thermostat_info$thermostatList %>% 
  purrr::map_chr('identifier') %>% 
  # make it a single-element character vector that separates IDs with a comma
  paste(collapse=',')

thermostat_ids

## [1] "28716,17611"

Given thermostat IDs we can request a report for a given time period. The report requires a startDate and endDate. By default the returned data are from midnight to 11:55 PM UTC. So a startInterval and endInterval shifts the time to the appropriate time zone which can be ascertained from thermostat_info$thermostatList[[1]]$location$timeZoneOffsetMinutes.

First thing to note is that the intervals are in groups of five minutes, so in a 24-hour day there are 288 intervals. We then account for being east or west of UTC, and subtract out a fifth of the offset. The endInterval is just one less than this startInterval in order to completely wrap around the clock. Doing these calculations should get us the entire day in our time zone.

timeOffset <- thermostat_info$thermostatList[[1]]$location$timeZoneOffsetMinutes
start_interval <- 287*(timeOffset > 0) - timeOffset/5
end_interval <- start_interval - 1

Now we build the request URL for the report. The body in the URL takes a number of parameters. We already have startDate, endDate, startInterval and endInterval. columns is a comma separated listing of the desired data points. For our purposes we want "zoneAveTemp,hvacMode,fan,outdoorTemp,outdoorHumidity,sky,wind,zoneClimate,zoneCoolTemp,zoneHeatTemp,zoneHvacMode,zoneOccupancy". The selection parameter takes two arguments: selectionType, which should be "thermostats" and selectionMatch, which is the comma-separated listing of thermostat IDs saved in thermostat_ids. We want data from the room sensors so we set includeSensors to true. Like in previous requests, this URL is built using glue::glue().

report <- httr::GET(
  glue::glue(
    # the request is to the runtimeReport endpoint with parameters passes in a json body that is part of the URL
    'https://api.ecobee.com/1/runtimeReport?format=json&body={{"startDate":"{startDate}","endDate":"{endDate}","startInterval":{startInterval},"endInterval":{endInterval},"columns":"zoneAveTemp,hvacMode,fan,outdoorTemp,outdoorHumidity,sky,wind,zoneClimate,zoneCoolTemp,zoneHeatTemp,zoneHvacMode,zoneOccupancy","selection":{{"selectionType":"thermostats","selectionMatch":"{thermostats}"}},"includeSensors":true}}'
  )
  # authentication
  , httr::add_headers(Authorization=glue::glue('Bearer {access_token}'))
  , encode='json'
) %>% 
  httr::content()

Handling the Data

Now that we have the report object we need to turn it into a nice data.frame or tibble. There are two major components to the data: thermostat information and sensor information. Multiple sensors are associated with a thermostat, including a sensor in the thermostat itself. So if our house has two HVAC zones, and hence two thermostats, we would have two sets of this information. The thermostat information includes overall data such as date, time, average temperature (average reading for all the sensors associated with a thermostat), HVACMode (heat or cool), fan speed and outdoor temperature. The sensor information has readings from each sensor associated with each thermostat such as detected occupancy and temperature.

The thermostat level and sensor level data are kept in lists inside the report object and need to be handled separately.

Thermostat Level Data

The report object has an element called reportList, which is a list where each element represents a different thermostat. For a house with two thermostats this list will have a length of two. Each of these elements contains a list called rowList. This has as many elements as the number of intervals requested, 288 for a full day (five-minute intervals for 24 hours). All of these elements are character vectors of length one, with commas separating the values. A few examples are below.

report$reportList[[1]]$rowList

## [[1]]
## [1] "2021-01-25,00:00:00,70.5,heat,75,28.4,43,5,0,Sleep,71,71,heatOff,0"
## 
## [[2]]
## [1] "2021-01-25,00:05:00,70.3,heat,300,28.4,43,5,0,Sleep,71,71,heatStage1On,0"
## 
## [[3]]
## [1] "2021-01-25,00:10:00,70.2,heat,300,28.4,43,5,0,Sleep,71,71,heatStage1On,0"
## 
## [[4]]
## [1] "2021-01-25,00:15:00,70.2,heat,300,28.4,43,5,0,Sleep,71,71,heatStage1On,0"
## 
## [[5]]
## [1] "2021-01-25,00:20:00,70.3,heat,300,28.4,43,5,0,Sleep,71,71,heatStage1On,0"

We can combine all of these into a single vector with unlist().

unlist(report$reportList[[1]]$rowList)

## [1] "2021-01-25,00:00:00,70.5,heat,75,28.4,43,5,0,Sleep,71,71,heatOff,0"      
## [2] "2021-01-25,00:05:00,70.3,heat,300,28.4,43,5,0,Sleep,71,71,heatStage1On,0"
## [3] "2021-01-25,00:10:00,70.2,heat,300,28.4,43,5,0,Sleep,71,71,heatStage1On,0"
## [4] "2021-01-25,00:15:00,70.2,heat,300,28.4,43,5,0,Sleep,71,71,heatStage1On,0"
## [5] "2021-01-25,00:20:00,70.3,heat,300,28.4,43,5,0,Sleep,71,71,heatStage1On,0"

The column names are stored in the columns element of the report object.

report$columns

## [1] "zoneAveTemp,HVACmode,fan,outdoorTemp,outdoorHumidity,sky,wind,zoneClimate,zoneCoolTemp,zoneHeatTemp,zoneHVACmode,zoneOccupancy"

This can be split into a vector of names using strsplit().

strsplit(report$columns, split=',')[[1]]

##  [1] "zoneAveTemp"     "HVACmode"        "fan"             "outdoorTemp"    
##  [5] "outdoorHumidity" "sky"             "wind"            "zoneClimate"    
##  [9] "zoneCoolTemp"    "zoneHeatTemp"    "zoneHVACmode"    "zoneOccupancy"

Perhaps a lesser known feature of readr::read_csv() is that rather than a file, it can read a character vector where each element has comma-separated values and return a tibble.

library(magrittr)
library(readr)
report$reportList[[1]]$rowList %>% 
  unlist() %>% 
  # add date and time to the specified column names
  read_csv(col_names=c('date', 'time', strsplit(report$columns, split=',')[[1]]))

date	time	zoneAveTemp	HVACmode	fan	outdoorTemp	outdoorHumidity	sky	zoneClimate	zoneCoolTemp	zoneHeatTemp	zoneHVACmode
2021-01-25	00:00:00	70.5	heat	75	28.4	43	5	Sleep	71	71	heatOff
2021-01-25	00:05:00	70.3	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
2021-01-25	00:10:00	70.2	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
2021-01-25	00:15:00	70.2	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
2021-01-25	00:20:00	70.3	heat	300	28.4	43	5	Sleep	71	71	heatStage1On

We repeat that for each element in report$reportList and we have a nice tibble with the data we need.

library(purrr)

## 
## Attaching package: 'purrr'

## The following object is masked from 'package:magrittr':
## 
##     set_names

# using the thermostat IDs as names let's them be identified in the tibble
names(report$reportList) <- purrr::map_chr(report$reportList, 'thermostatIdentifier')
central_thermostat_info <- report$reportList %>% 
  map_df(
    ~ read_csv(unlist(.x$rowList), col_names=c('date', 'time', strsplit(report$columns, split=',')[[1]])), 
    .id='Thermostat'
  )
central_thermostat_info

Thermostat	date	time	zoneAveTemp	HVACmode	fan	outdoorTemp	outdoorHumidity	sky	zoneClimate	zoneCoolTemp	zoneHeatTemp	zoneHVACmode
28716	2021-01-25	00:00:00	70.5	heat	75	28.4	43	5	Sleep	71	71	heatOff
28716	2021-01-25	00:05:00	70.3	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
28716	2021-01-25	00:10:00	70.2	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
28716	2021-01-25	00:15:00	70.2	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
28716	2021-01-25	00:20:00	70.3	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
17611	2021-01-25	00:00:00	64.4	heat	135	28.4	43	5	Sleep	78	64	heatOff
17611	2021-01-25	00:05:00	64.2	heat	0	28.4	43	5	Sleep	78	64	heatOff
17611	2021-01-25	00:10:00	64.0	heat	0	28.4	43	5	Sleep	78	64	heatOff
17611	2021-01-25	00:15:00	63.9	heat	0	28.4	43	5	Sleep	78	64	heatOff
17611	2021-01-25	00:20:00	63.7	heat	0	28.4	43	5	Sleep	78	64	heatOff

Sensor Level Data

Due to the way the sensor data is stored they are more difficult to extract than the thermostat data.

First, we get the column names for the sensors. The problem is, this varies depending on how many sensors are associated with each thermostat and also the type of sensor. As of now there are three types: the thermostat itself, which measures occupancy, temperature and humidity and two kinds of remote sensors, both of which measure occupancy and temperature.

To do this we write two functions, one relates a sensor ID to a sensor name and the other joins in the result of the first function to match the column names listed in the columns element of each element in the sensorList part of the report.

relate_sensor_id_to_name <- function(sensorInfo)
{
  purrr::map_df(
    sensorInfo, 
    ~tibble::tibble(ID=.x$sensorId, name=glue::glue('{.x$sensorName}_{.x$sensorType}'))
  )
}

# see how it works
relate_sensor_id_to_name(report$sensorList[[1]]$sensors)

ID	name
rs:100:2	Bedroom 1_occupancy
rs:101:1	Master Bedroom_temperature
rs:101:2	Master Bedroom_occupancy
rs2:100:1	Office 1_temperature
rs2:100:2	Office 1_occupancy
rs2:101:1	Office 2_temperature
rs:100:1	Bedroom 1_temperature
rs2:101:2	Office 2_occupancy
ei:0:1	Thermostat Temperature_temperature
ei:0:2	Thermostat Humidity_humidity
ei:0:3	Thermostat Motion_occupancy

make_sensor_column_names <- function(sensorInfo)
{
  sensorInfo$columns %>% 
    unlist() %>% 
    tibble::enframe(name='index', value='id') %>% 
    dplyr::left_join(relate_sensor_id_to_name(sensorInfo$sensors), by=c('id'='ID')) %>% 
    dplyr::mutate(name=as.character(name)) %>% 
    dplyr::mutate(name=dplyr::if_else(is.na(name), id, name))
}

# see how it works
make_sensor_column_names(report$sensorList[[1]])

index	id	name
1	date	date
2	time	time
3	rs:100:2	Bedroom 1_occupancy
4	rs:101:1	Master Bedroom_temperature
5	rs:101:2	Master Bedroom_occupancy
6	rs2:100:1	Office 1_temperature
7	rs2:100:2	Office 1_occupancy
8	rs2:101:1	Office 2_temperature
9	rs:100:1	Bedroom 1_temperature
10	rs2:101:2	Office 2_occupancy
11	ei:0:1	Thermostat Temperature_temperature
12	ei:0:2	Thermostat Humidity_humidity
13	ei:0:3	Thermostat Motion_occupancy

Then for a set of sensors we can read the data from the data element using the read_csv() trick we saw earlier. Some manipulation is needed to we pivot the data longer, keep certain rows, break apart a column using tidyr::separate() make some changes with dplyr::mutate() then pivot wider.This results in a tibble, where each row represents the occupancy and temperature reading for a particular sensor and a given five-minute increment.

extract_one_sensor_info <- function(sensor)
{
  sensor_col_names <- make_sensor_column_names(sensor)$name
  sensor$data %>% 
    unlist() %>% 
    readr::read_csv(col_names=sensor_col_names) %>% 
    # make it longer so we can easy remove rows based on a condition
    tidyr::pivot_longer(cols=c(-date, -time), names_to='Sensor', values_to='Reading') %>% 
    # we use slice because grep() returns a vector of indices, not TRUE/FALSE
    dplyr::slice(grep(pattern='_(temperature)|(occupancy)$', x=Sensor, ignore.case=FALSE)) %>% 
    # split apart the sensor name from what it's measuring
    tidyr::separate(col=Sensor, into=c('Sensor', 'Measure'), sep='_', remove=TRUE) %>% 
    # rename the actual thermostats to say thermostat
    dplyr::mutate(Sensor=sub(pattern='Thermostat .+$', replacement='Thermostat', x=Sensor)) %>% 
    # back into wide format so each sensor is its own column
    tidyr::pivot_wider(names_from=Measure, values_from=Reading)
}

# see how it works
extract_one_sensor_info(report$sensorList[[1]])

date	time	Sensor	occupancy	temperature
2021-01-25	00:00:00	Bedroom 1	0	69.5
2021-01-25	00:00:00	Master Bedroom	1	71.6
2021-01-25	00:00:00	Office 1	1	72.7
2021-01-25	00:00:00	Office 2	0	74.9
2021-01-25	00:00:00	Thermostat	0	65.8
2021-01-25	00:05:00	Bedroom 1	0	69.3
2021-01-25	00:05:00	Master Bedroom	1	71.3
2021-01-25	00:05:00	Office 1	0	72.5
2021-01-25	00:05:00	Office 2	0	74.8
2021-01-25	00:05:00	Thermostat	0	65.7
2021-01-25	00:10:00	Bedroom 1	0	69.2
2021-01-25	00:10:00	Master Bedroom	0	71.2
2021-01-25	00:10:00	Office 1	0	72.5
2021-01-25	00:10:00	Office 2	0	74.8
2021-01-25	00:10:00	Thermostat	0	65.6
2021-01-25	00:15:00	Bedroom 1	0	69.2
2021-01-25	00:15:00	Master Bedroom	1	71.2
2021-01-25	00:15:00	Office 1	0	72.5
2021-01-25	00:15:00	Office 2	0	74.8
2021-01-25	00:15:00	Thermostat	0	65.5
2021-01-25	00:20:00	Bedroom 1	0	69.4
2021-01-25	00:20:00	Master Bedroom	1	71.2
2021-01-25	00:20:00	Office 1	0	72.7
2021-01-25	00:20:00	Office 2	0	74.9
2021-01-25	00:20:00	Thermostat	0	65.4

Then we put it all together and iterate over the sets of sensors attached to each thermostat.

extract_sensor_info <- function(report)
{
    names(report$sensorList) <- purrr::map_chr(report$reportList, 'thermostatIdentifier')
    purrr::map_df(report$sensorList, extract_one_sensor_info, .id='Thermostat')
}

We now save all of the sensor information in sensor_info.

sensor_info <- extract_sensor_info(report)

Combining Sensor and Thermostat Data

To nicely align the thermostat overall readings and settings with the individual sensor readings, we join the two together using dplyr::left_join() with the Thermostat, date and time columns as keys. We also want to join in the nice thermostat names based on the thermostat IDs, so that tibble will need to be created first.

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.0.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

thermostat_names <- tibble(
  ID=map_chr(thermostat_info$thermostatList, 'identifier')
  , Name=map_chr(thermostat_info$thermostatList, 'name')
)

all_info <- inner_join(x=central_thermostat_info, y=sensor_info, by=c('Thermostat', 'date', 'time')) %>% 
  left_join(thermostat_names, by=c('Thermostat'='ID')) %>% 
  mutate(Sensor=if_else(Sensor=='Thermostat', glue::glue('{Name} Thermostat'), Sensor)) %>% 
  relocate(Name, Sensor, date, time, temperature, occupancy)

# see it
all_info

Name	Sensor	date	time	temperature	occupancy	Thermostat	zoneAveTemp	HVACmode	fan	outdoorTemp	outdoorHumidity	sky	zoneClimate	zoneCoolTemp	zoneHeatTemp	zoneHVACmode
Upstairs	Bedroom 1	2021-01-25	00:00:00	69.5	0	28716	70.5	heat	75	28.4	43	5	Sleep	71	71	heatOff
Upstairs	Master Bedroom	2021-01-25	00:00:00	71.6	1	28716	70.5	heat	75	28.4	43	5	Sleep	71	71	heatOff
Upstairs	Office 1	2021-01-25	00:00:00	72.7	1	28716	70.5	heat	75	28.4	43	5	Sleep	71	71	heatOff
Upstairs	Office 2	2021-01-25	00:00:00	74.9	0	28716	70.5	heat	75	28.4	43	5	Sleep	71	71	heatOff
Upstairs	Upstairs Thermostat	2021-01-25	00:00:00	65.8	0	28716	70.5	heat	75	28.4	43	5	Sleep	71	71	heatOff
Upstairs	Bedroom 1	2021-01-25	00:05:00	69.3	0	28716	70.3	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
Upstairs	Master Bedroom	2021-01-25	00:05:00	71.3	1	28716	70.3	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
Upstairs	Office 1	2021-01-25	00:05:00	72.5	0	28716	70.3	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
Upstairs	Office 2	2021-01-25	00:05:00	74.8	0	28716	70.3	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
Upstairs	Upstairs Thermostat	2021-01-25	00:05:00	65.7	0	28716	70.3	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
Upstairs	Bedroom 1	2021-01-25	00:10:00	69.2	0	28716	70.2	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
Upstairs	Master Bedroom	2021-01-25	00:10:00	71.2	0	28716	70.2	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
Upstairs	Office 1	2021-01-25	00:10:00	72.5	0	28716	70.2	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
Upstairs	Office 2	2021-01-25	00:10:00	74.8	0	28716	70.2	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
Upstairs	Upstairs Thermostat	2021-01-25	00:10:00	65.6	0	28716	70.2	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
Upstairs	Bedroom 1	2021-01-25	00:15:00	69.2	0	28716	70.2	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
Upstairs	Master Bedroom	2021-01-25	00:15:00	71.2	1	28716	70.2	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
Upstairs	Office 1	2021-01-25	00:15:00	72.5	0	28716	70.2	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
Upstairs	Office 2	2021-01-25	00:15:00	74.8	0	28716	70.2	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
Upstairs	Upstairs Thermostat	2021-01-25	00:15:00	65.5	0	28716	70.2	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
Upstairs	Bedroom 1	2021-01-25	00:20:00	69.4	0	28716	70.3	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
Upstairs	Master Bedroom	2021-01-25	00:20:00	71.2	1	28716	70.3	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
Upstairs	Office 1	2021-01-25	00:20:00	72.7	0	28716	70.3	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
Upstairs	Office 2	2021-01-25	00:20:00	74.9	0	28716	70.3	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
Upstairs	Upstairs Thermostat	2021-01-25	00:20:00	65.4	0	28716	70.3	heat	300	28.4	43	5	Sleep	71	71	heatStage1On
Downstairs	Living Room	2021-01-25	00:00:00	64.4	0	17611	64.4	heat	135	28.4	43	5	Sleep	78	64	heatOff
Downstairs	Playroom	2021-01-25	00:00:00	63.7	0	17611	64.4	heat	135	28.4	43	5	Sleep	78	64	heatOff
Downstairs	Kitchen	2021-01-25	00:00:00	66.3	0	17611	64.4	heat	135	28.4	43	5	Sleep	78	64	heatOff
Downstairs	Downstairs Thermostat	2021-01-25	00:00:00	65.1	0	17611	64.4	heat	135	28.4	43	5	Sleep	78	64	heatOff
Downstairs	Living Room	2021-01-25	00:05:00	64.2	0	17611	64.2	heat	0	28.4	43	5	Sleep	78	64	heatOff
Downstairs	Playroom	2021-01-25	00:05:00	63.4	0	17611	64.2	heat	0	28.4	43	5	Sleep	78	64	heatOff
Downstairs	Kitchen	2021-01-25	00:05:00	65.8	0	17611	64.2	heat	0	28.4	43	5	Sleep	78	64	heatOff
Downstairs	Downstairs Thermostat	2021-01-25	00:05:00	64.8	0	17611	64.2	heat	0	28.4	43	5	Sleep	78	64	heatOff
Downstairs	Living Room	2021-01-25	00:10:00	63.9	0	17611	64.0	heat	0	28.4	43	5	Sleep	78	64	heatOff
Downstairs	Playroom	2021-01-25	00:10:00	63.2	0	17611	64.0	heat	0	28.4	43	5	Sleep	78	64	heatOff
Downstairs	Kitchen	2021-01-25	00:10:00	65.7	0	17611	64.0	heat	0	28.4	43	5	Sleep	78	64	heatOff
Downstairs	Downstairs Thermostat	2021-01-25	00:10:00	64.4	0	17611	64.0	heat	0	28.4	43	5	Sleep	78	64	heatOff
Downstairs	Living Room	2021-01-25	00:15:00	63.7	0	17611	63.9	heat	0	28.4	43	5	Sleep	78	64	heatOff
Downstairs	Playroom	2021-01-25	00:15:00	62.9	0	17611	63.9	heat	0	28.4	43	5	Sleep	78	64	heatOff
Downstairs	Kitchen	2021-01-25	00:15:00	65.6	0	17611	63.9	heat	0	28.4	43	5	Sleep	78	64	heatOff
Downstairs	Downstairs Thermostat	2021-01-25	00:15:00	64.1	0	17611	63.9	heat	0	28.4	43	5	Sleep	78	64	heatOff
Downstairs	Living Room	2021-01-25	00:20:00	63.4	0	17611	63.7	heat	0	28.4	43	5	Sleep	78	64	heatOff
Downstairs	Playroom	2021-01-25	00:20:00	62.6	0	17611	63.7	heat	0	28.4	43	5	Sleep	78	64	heatOff
Downstairs	Kitchen	2021-01-25	00:20:00	65.4	0	17611	63.7	heat	0	28.4	43	5	Sleep	78	64	heatOff
Downstairs	Downstairs Thermostat	2021-01-25	00:20:00	63.9	0	17611	63.7	heat	0	28.4	43	5	Sleep	78	64	heatOff

We now have a nice tibble in long format where each row shows a number of measurements and setting for a sensor for a given period of time.

Saving Data in the Cloud

Now that we have the data we need to save it somewhere. Since these data should be more resilient, cloud storage seems like the best option. There are many services including Azure, AWS, BackBlaze and GCP. But I prefer to use DigitalOcean Spaces since it has an easy interface and can take advantage of standard S3 APIs.

DigitalOcean uses slightly different terminology than AWS, in that buckets are called spaces. Otherwise we still need to deal with Access Key IDs, Secret Access Keys, Regions and Endpoints. A good tutorial on creating DigitalOcean spaces comes directly from them. In fact, they have so many great tutorials about Linux in general, I highly recommend using them for learning so much about computing.

While there is a DigitalOcean package, aptly named {analogsea}, I did not have luck getting it to work, so instead I used {aws.s3}.

After writing the all_info tibble to a CSV, we put it to the DigitalOcean bucket using aws.s3::put_object(). This requires three arguments:

file: The name of the file on our computer.
object: The path including filename inside the S3 bucket where the file will be saved.
bucket: The name of the S3 bucket, or in our case the space (DigitalOcean calls buckets spaces).

Implicitly, put_object() and most of the functions in {aws.s3} depend on certain environment variables being set. The first two are AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY which correspond to the DigitalOcean Access Key and Secret Key, respectively. These can be generated at https://cloud.digitalocean.com/account/api/tokens. The Access Key can be retrieved at any time, but the Secret Key is only displayed one time, so we must save it somewhere. Just like AWS, DigitalOcean has regions and this value is saved in the AWS_DEFAULT_REGION environment variable. In this case the region is nyc3. Then we need the AWS_S3_ENDPOINT environment variable. The DigitalOcean documentation makes it seem like this should be nyc3.digitaloceanspaces.com (or the appropriate region) but for our purposes it should just be digitaloceanspaces.com.

Perhaps the most common way to set environment variables in R is to save them in the VARIABLE_NAME=value format in the .Renviron file either in our home directory or project directory. But we must be sure NOT to track this file with git lest we publicly expose our secret information.

Then we can call the put_object() function.

# we make a file each day so it is important to include the date in the name.
filename <- 'all_info_2021-02-15.csv'
aws.s3::put_object(
  file=filename, object=sprintf('%s/%s', 'do_folder_name', filename), bucket='do_bucket_name'
)

After this, the file lives in a DigitalOcean space to be accessed later.

Putting it All Together with `{targets}`

There are a lot of moving parts to this whole process, even more so than displayed here. They could have all been put into a script but that tends to be fragile, so instead we use {targets}. This package, which is the eventual replacement for {drake}, builds a robust pipeline that schedules jobs to be run based on which jobs depend on others. It is intelligent in that it only runs jobs that are out of date (usually meaning neither the code nor data have been modified) and can run jobs in parallel. The author, Will Landau, gave an excellent talk about this at the New York Open Statistical Programming Meetup.

In order to use {targets}, we need to have a file named _targets.R in the root of our project. In there we have a list of targets, or jobs, each defined in a tar_target() function (or tar_force(), tar_change() or similar function). The first argument to tar_target() is the name of the target and the next argument is an expression, usually a function call, whose value is saved as the name.

A small example would be a target to get the access token, a target for getting thermostat information (which requires the access token), another to extract the thermostat IDs from that information and a last target to get the report based on the IDs and token.

library(targets)
list(
  tar_target(
    access_token,
    httr::POST(
      url=glue::glue("https://api.ecobee.com/token?grant_type=refresh_token&code={Sys.getenv('ECOBEE_REFRESH_TOKEN')}&client_id={Sys.getenv('ECOBEE_API_KEY')}"), 
      encode='json'
    )
  )
  , tar_target(
    thermostat_info,
    httr::GET(
      'https://api.ecobee.com/1/thermostat?format=json&body={"selection":{"selectionType":"registered","selectionMatch":"","includeLocation":true}}'
      , httr::add_headers(Authorization=sprintf('Bearer %s', access_token))
      , encode='json'
    ) %>% 
      httr::content()
  )
  , tar_target(
    thermostat_ids,
    thermostat_info$thermostatList %>% 
      purrr::map_chr('identifier') %>% 
      paste(collapse=',')
  )
  , tar_target(
    report,
    httr::GET(
        sprintf(
            'https://api.ecobee.com/1/runtimeReport?format=json&body={"startDate":"2021-01-26","endDate":"2021-01-26","startInterval":60,"endInterval":59,"columns":"zoneAveTemp,hvacMode,fan,outdoorTemp,outdoorHumidity,sky,wind,zoneClimate,zoneCoolTemp,zoneHeatTemp,zoneHvacMode,zoneOccupancy","selection":{"selectionType":"thermostats","selectionMatch":"%s"},"includeSensors":true}',
            thermostat_ids
        )
        , httr::add_headers(Authorization=sprintf('Bearer %s', access_token))
        , encode='json'
    ) %>% 
        httr::content()
  )
)

With the _targets.R file in the root of the directory, we can visualize the steps with tar_visnetwork().

tar_visnetwork()

To execute the jobs we use tar_make().

tar_make()

The results of individual jobs, known as targets, can be loaded into the working session with tar_load(target_name) or assigned to a variable with variable_name <- tar_read(target_name).

Each time this is run the appropriate file in the DigitalOcean space either gets written for the first time, or overwritten.

The actual set of targets for this project were much more involved and can be found on GitHub. Each target is the result of a custom function, which allows for more thorough documentation and testing. The version on GitHub even allows us to run the workflow for different dates so we can backfill data if needed.

Please note that anywhere there is potentially sensitive information such as the refresh token or API key, those were saved in environment variables, then accessed via Sys.getenv(). It is very important not to check the .Renviron files, where environment variables are stored, into git. They should be treated like passwords.

What’s Next?

Now that we have a functioning workflow that we can call anytime, the question becomes how do we run this on a regular basis? We could set up a cron job on a server that requires the server to always be up and we need to maintain this process. We could use scheduled lambda functions, but that’s also a lot of work. Instead, we’ll use GitHub Actions to run this workflow on a scheduled basis. We’ll go over that in the next blog post.

My team at Lander Analytics has been putting together conferences for six years, and they’ve always had the same fun format, which the community has really enjoyed. There’s the NYR conference for New Yorkers and those who want to fly, drive or train to join the New York community, and there’s DCR, which gathers the DC-area community. The last DCR Conference at Georgetown University went really well, as you can see in this recap. With the shift to virtual gatherings brought on by the pandemic, our community has gone fully remote, including the monthly Open Statistical Programming Meetup. With that, we realized the DCR Conference didn’t just need to be for folks from the DC-area anymore, instead, we could welcome a global audience like we did with this year’s NYR. And that gave birth to R|Gov, the Government and Public Sector R Conference.

R|Gov is really a new industry-focused conference. Instead of drawing on speakers from a particular city or area, the talks will focus on work done in specific fields. In this case, in government, defense, NGOs and the public sector, and we have speakers from not only the DC-area, but also from Geneva, Switzerland, Nashville, Tennessee, Quebec, Canada and Los Angeles, California. For the last three years, we have been working with Data Community DC, R-Ladies DC, and the Statistical Programming DC Meetup, to put on DCR, and continue to do so for R|Gov as we find great speakers and organizations who want to collaborate in driving attendance and building the community.

Like NYR and DCR, the topics at R|Gov range from practical how-tos, to theoretical findings, to processes, to tooling and the speakers this year come from the Center for Army Analysis, NASA, Columbia University, The U.S. Bureau of Labor Statistics, the Inter-American Development Bank, The United States Census Bureau, Harvard Business School, In-Q-Tel, Virginia Tech, Deloitte, NYC Department of Health and Mental Hygiene and Georgetown University, among others. We will also be hosting two rum and gin master classes, including one with Mount Gay, which comes from the oldest continuously running rum distillery in the world, and which George Washington served at his inauguration!

The R Conference series is quite a bit different from other industry and academic conferences. The talks are twenty minutes long with no audience questions with the exception of special talks from the likes of Andrew Gelman or Hadley Wickham. Whether in person or virtual, we play music, have prize giveaways and involve food in the programming. When they were in person, we prided ourselves on avocado toast, pizza, ice cream and beer. For prizewinners, we autographed books right on stage since the authors were either speakers or in the audience. With the virtual events we try to capture as much of that spirit as possible, and the community really enjoyed the virtual R Conference | NY in August. A very lively event remotely and in the flesh, it is also one of the more informative conferences I have ever seen.

This virtual conference will include much of the in-person format, just recreated virtually. We will have 24 talks, a panel, workshops, community and networking breaks, happy hours, prizes and giveaways, a Twitter Contest, Meet the Speaker series, Job Board access, and participation in the Art Auction. We hope to see you there December 2-4, on a comfy couch near you.

Contact my team if you would like to sponsor. To learn more about the speaker lineup, workshops, and agenda visit rstats.ai/gov and landeranalytics.com. Join our slack team and follow us at @rstatsdc and @landeranalytics.

The sixth annual (and first virtual) “New York” R Conference took place August 5-6 & 12-15. Almost 300 attendees, and 30 speakers, plus a stand-up comedian and a whiskey masterclass leader, gathered remotely to explore, share, and inspire ideas.

We had many awesome speakers, many new and some, returning: Dr. Rob J Hyndman (Monash University), Dr. Adam Obeng (Facebook), Ludmila Janda (Amplify), Emily Robinson (Warby Parker), Daniel Chen (Virginia Tech, Lander Analytics), Dr. Jon Krohn (untapt), Dr. Andrew Gelman (Columbia University), David Smith (Microsoft), Laura Gabrysiak (Visa), Brooke Watson (ACLU), Dr. Sebastian Teran Hidalgo (Vroom), Catherine Zhou (Codecademy), Dr. Jacqueline Nolis (Brightloom), Sonia Ang (Microsoft), Emily Dodwell (AT&T Labs Research), Jonah Gabry (Columbia University, Stan Development Team), Wes MckKinney and Dr. Neal Richardson (Ursa Labs), Dr. Thomas Mock (RStudio), Dr. David Robinson, (Heap), Dr. Max Kuhn (RStudio), Dr. Erin LeDell (H2O.ai), Monica Thieu (Columbia University), Camelia Hssaine (Codecademy), and myself and, coming soon, a bonus talk by Heather Nolis (T-Mobile) which will be shared on YouTube as soon as our team is done editing them, along with all the other talks.

Let’s take a look at some of the highlights from the conference:

Andrew Gelman Gave Another 40-Minute Talk (no slides, as always)

Our favorite quotes from Andrew Gelman’s talk, Truly Open Science: From Design and Data Collection to Analysis and Decision Making, which had no slides, as usual:

“Everyone training in statistics becomes a teacher.”

“The most important thing you should take away — put multiple graphs on a page.”

“Honesty and transparency are not enough.”

“Bad science doesn’t make someone a bad person.”

Laura Gabrysiak Shows us We Are Driven By Experience, and not Brand Loyalty…Hope you Folks had a Good Experience!

Laura’s talk on re-Inventing customer engagement with machine learning went through several interesting use cases from her time at Visa. In addition to being a data scientist, she is an active community organizer and the co-founder of R-Ladies Miami.

Adam Obeng Delivered a Talk on Adaptive Experimentation

One of my former students at Columbia University, Adam Obeng, gave a great presentation on his adaptive experimentation. We learned that adaptive experimentation is three things: The name of (1) a family of techniques, (2) Adam’s team at Facebook, and (3) an open source package produced by said team. He went through the applications which are hyper-parameter optimization for ML, experimentation with multiple continuous treatments, and physical experiments or manufacturing.

Dr. Jacqueline Nolis Invited Us to Crash Her Viral Website, Tweet Mashup

Jacqueline asked the crowd to crash her viral website,Tweet Mashup, and gave a great talk on her experience building it back in 2016. Her website that lets you combine the tweets of two different people. After spending a year making it in .NET, when she launched the site it became an immediate sensation. Years later, she was getting more and more frustrated maintaining the F# code and decided to see if I could recreate it in Shiny. Doing so would require having Shiny integrate with the Twitter API in ways that hadn’t been done by anyone before, and pushing the Twitter API beyond normal use cases.

Attendees Participated in Two Virtual Happy Hours Packed with Fun

At the Friday Happy Hour, we had a mathematical standup comedian for the first time in R Conference history. Comic and math major Rachel Lander (no relationship to me!) entertained us with awesome math and stats jokes.

Following the stand up, we had a Whiskey Master Class with our Vibe Sponsor Westland Distillery, and another one on Saturday with Bruichladdich Distillery (hard to pronounce and easy to drink). Attendees and speakers learned and drank together, whether it be their whiskey, matchas, soda or water.

All Proceeds from the A(R)T Auction went to the R Foundation Again

A newer tradition, the A(R)T Auction, took place again! We featured pieces by artists in the R Community, and all proceeds were donated to the R Foundation. The highest-selling piece at auction was Street Cred (2020) by Vivian Peng (Lander Analytics and Los Angeles Mayor’s Office, Innovation Team). The second highest was a piece by Jacqueline Nolis (Brightloom, and Build a Career in Data Science co-author), R Conference speaker, Designed by Allison Horst, artist in residence at RStudio.

The R-Ladies Group Photo Happened, Even Remotely!

As per tradition, we took an R-Ladies group photo, but, for the first time, remotely– as a screenshot! We would like to note that many more R-Ladies were present in the chat, but just chose not to share video.

Jon Harmon, Edna Mwenda, and Jessica Streeter win Raspberri Pis, Bluetooth Headphones, and Tenkeyless Keyboards for Most Active Tweeting During the Conference

This year’s Twitter Contest, in Malorie’s words, was a “ruthless but noble war.” You can see the NYR 2020 Dashboard here. A custom started that DCR 2018 by our Twitter scorekeeper Malorie Hughes (@data_all_day) has returned every year by popular demand, and now she’s stuck with it forever! Congratulations to our winners!

50+ Conference Attendees Participated in Pre-Conference Workshops Before

For the first time ever, workshops took place over the course of several days to promote work-life balance, and to give attendees the chance to take more than one course. We ran the following seven workshops:

Machine Learning in R with Max Kuhn
Machine learning in Python with scikit-learn with Andreas Mueller

Data Visualization with Shiny Workshop with Vivian Peng
Geospatial Statistics and Mapping in R with Kaz Sakamoto
Git for Data Science with Daniel Chen

Exploring Data with the Tidyverse with David Robinson
Web Scraping with Selenium with Amanda Dobbyn

Recreating the In-Person Experience

We recreated as much of the in-person experience as possible with attendee networking sessions, the speaker walk-on songs and fun facts, abundant prizes and giveaways, the Twitter contest, an art auction, and happy hours. In addition to all of this, we mailed conference programs, hex stickers, and other swag to each attendee (in the U.S.), along with discount codes from our Vibe Sponsors, MatchaBar, Westland Distillery and Bruichladdich Distillery.

Thank you, Lander Analytics Team!

Even though it was virtual, there was a lot of work that went into the conference, and I want to thank my amazing team at Lander Analytics along with our producer, Bill Prickett, for making it all come together.

Looking Forward to D.C. and Dublin
If you attended, we hope you had an incredible experience. If you did not, we hope to see you at the virtual DC R Conference in the fall, and at the first Dublin R Conference and the NYR next year!

Earlier this week, my company, Lander Analytics, organized our first public Bayesian short course, taught by Andrew Gelman, Bob Carpenter and Daniel Lee. Needless to say the class sold out very quickly and left a long wait list. So we will schedule another public training (exactly when tbd) and will make the same course available for private training.

This was the first time we utilized three instructors (as opposed to a main instructor and assistants which we often use for large classes) and it led to an amazing dynamic. Bob laid the theoretical foundation for Markov chain Monte Carlo (MCMC), explaining both with math and geometry, and discussed the computational considerations of performing simulation draws. Daniel led the participants through hands-on examples with Stan, covering everything from how to describe a model, to efficient computation to debugging. Andrew gave his usual, crowd dazzling performance use previous work as case studies of when and how to use Bayesian methods.

It was an intensive three days of training with an incredible amount of information. Everyone walked away knowing a lot more about Bayes, MCMC and Stan and eager to try out their new skills, and an autographed copy of Andrew’s book, BDA3.

A big help, as always was Daniel Chen who put in so much effort making the class run smoothly from securing the space, physically moving furniture and running all the technology.

[Show slideshow]

$Gelman Math$

Attending this week’s Strata conference it was easy to see quite how prolific the NYC Data Mafia is when it comes to writing. Some of the found books:

Books from the #nycdatamafia @drewconway @johnmyleswhite http://t.co/EuV4FF6JA7 pic.twitter.com/Oi8tVcjPYE

— NYC Data Hackers (@nyhackr) October 29, 2013

Books from the #nycdatamafia @mikedewar http://t.co/w2oCS2jLvN pic.twitter.com/yiq9x6SG3y

— NYC Data Hackers (@nyhackr) October 29, 2013

Books from the #nycdatamafia @wesmckinn http://t.co/jhUPSrtTOE pic.twitter.com/ri5eUhWwY0

— NYC Data Hackers (@nyhackr) October 29, 2013

Books from the #nycdatamafia Rachel Schutt @mathbabedotorg http://t.co/EVI6HanjUb pic.twitter.com/yTL0fXQGBK

— NYC Data Hackers (@nyhackr) October 29, 2013

Books from the #nycdatamafia @HarlanH @wahalulu http://t.co/6CjAvGsHRL pic.twitter.com/0DwMqSmNve

— NYC Data Hackers (@nyhackr) October 29, 2013

Books from the #nycdatamafia @qethanm http://t.co/Hy82gz4tGe pic.twitter.com/Uba15XIhLT

— NYC Data Hackers (@nyhackr) October 29, 2013

Books from the #nycdatamafia @pauldix http://t.co/Tdw0MSF5B7 pic.twitter.com/4rmpk5UuYf

— NYC Data Hackers (@nyhackr) October 29, 2013

And, of course, my book will be out soon to join them.

For the past few weeks Time Out New York‘s Dating columnist, Jamie Bufalino, has been fielding letters discussing the ratio of homosexual to heterosexual questions he answers. The readers suggested that disproportionate attention is paid to Gay and Lesbian issues compared to the Gay and Lesbian proportion of the general population.

Jamie rudely called his readers “ass-wipes” and repeatedly told them to “remove your head from your ass.” He also professed to have “no idea what the percentage is of gay/bi versus straight issues that end up in the column.”

One question and response:

Q I see statistics that show NYC to be 6 percent gay, lesbian and bi, and yet in “Get Naked” you feature letters from them almost to the exclusion of heteros. Why the preoccupation with them in your column? It doesn’t seem right or logical. As one of the other 94 percent, I am disappointed and offended weekly.

A All I can say is: You’ve got your head up your butt. Just in the past month or so, I’ve answered letters from a straight guy with a weird fetish that suddenly stopped delivering the jollies it used to, a straight guy who was juggling a woman from the Ukraine and a woman from Jersey, a woman who had an issue with sticking her finger up her boyfriend’s butt, a 19-year-old woman who was getting pressured to have sex with her boyfriend, and on and on. If, for some reason, you happen to be obsessing over the gay and bi questions and not acknowledging the straight ones, that’s your issue, not mine.

And another:

Q I always read your column to see if I can learn something and just for shits and giggles. The one thing that has always bothered me is your preoccupation with gay and bi problems. Gays and lesbians get their own special section of three to four pages!

A First of all, dude, you sound like one of those total ass-wipes who believes that gay people somehow have all these special privileges that straight people aren’t entitled to. Honestly, I have no idea what the percentage is of gay/bi versus straight issues that end up in the column, because it doesn’t matter. If you removed your head from your ass, you’d realize that so many sexual issues are universal and that you can learn something from all sorts of people who don’t fit into your specific demographic.

When confronted with the data he once again reffered to a “head lodged up [a] rectum” and suggested the reader was “paranoid.”

Q As a statistician I am disappointed by your response to a question in the November 4 issue [TONY 788]. The reader wrote, “I see statistics that show NYC to be 6 percent gay, lesbian and bi, and yet in ‘Get Naked’ you feature letters from them almost to the exclusion of heteros. Why the preoccupation with them in your column? … As one of the other 94 percent, I am disappointed and offended weekly.” You responded by citing individual examples of heterosexual questions you’ve fielded, which is not a valid form of proof. I went through about ten months’ worth of “Get Naked” columns on the TONY website and found that approximately 19 percent of the questions were from gay (15 percent) or lesbian (4 percent) readers. Whether or not that percentage is representative of the general population is not my concern. I just feel that Jamie should have his data correct and not write, “You’ve got your head up your butt.”

A I seriously cannot believe I am still getting letters about this. Okay, Mr. Disappointed Statistician: If you don’t want to come off as someone who has his head lodged up his rectum, it would be an awesome idea not to leap to the defense of some jackass who claims I cater to homo letters “almost to the exclusion of heteros” and then point out that straight issues actually make up a full 81 percent of the subject matter here in “Get Naked.” What I want to know is, why are you even keeping score? Are you really that insecure about the amount of attention heterosexual sex gets in the media? If so, that’s both laughable and sad. This is the last time I’m addressing this, so here’s my final bit of advice to you (and your like-minded brethren): Stop being so paranoid.

Since Jamie is so rude to his readers and clearly doesn’t have any sense of the data, I thought I’d take a look at the numbers. Results after the break.

Continue reading →

Last week Slice ran a post about a tomato taste test they conducted with Scott Wiener (of Scott’s NYC Pizza Tours), Brooks Jones, Jason Feirman, Nick Sherman and Roberto Caporuscio from Keste. While the methods used may not be rigorous enough for definitive results, I took the summary data that was in the post and performed some simple analyses.

The first thing to note is that there are only 16 data points, so multiple regression is not an option. We can all thank the Curse of Dimensionality for that. So I stuck to simpler methods and visualizations. If I can get the raw data from Slice, I can get a little more advanced.

For the sake of simplicity I removed the tomatoes from Eataly because their price was such an outlier that it made visualizing the data difficult. As usual, most of the graphics were made using ggplot2 by Hadley Wickham. The coefficient plots were made using a little function I wrote. Here is the code. Any suggestions for improvement are greatly appreciated, especially if you can help with increasing the left hand margin of the plot. And as always, all the work was done in R.

The most obvious relationship we want to test is Overall Quality vs. Price. As can be seen from the scatterplot below with a fitted loess curve, there is not a linear relationship between price and quality.

More after the break. Continue reading →

Getting the Data

Preparing the Data

Skip All of This

Visualizing with Interactive Plots

Computing Cross Correlations

Fitting Time Series Models with {fable}

Running Everything with {targets}

Reporting with RStudio Connect

What Comes Next?

Related Posts

Docker Image

GitHub Actions

End of Day

What’s Next?

Related Posts

Accessing the API

Handling the Data

Thermostat Level Data

Sensor Level Data

Combining Sensor and Thermostat Data

Saving Data in the Cloud

Putting it All Together with {targets}

What’s Next?

Related Posts

Related Posts

Related Posts

Related Posts

Related Posts

Related Posts

Related Posts

Related Posts

Fitting Time Series Models with `{fable}`

Running Everything with `{targets}`

Putting it All Together with `{targets}`