In previous posts I talked about collecting temperature data from different rooms in my house and automating that process using Docker and GitHub Actions. Now it is time to analyze that data to figure out what is happening in the house to make informed decisions about replacing the HVAC units and how I can make the house more comfortable for everyone until then.
During this process I pull the data into R from a DigitalOcean space (like an AWS S3 bucket) using {arrow}
, manipulate the data with {dplyr}
, {tidyr}
and {tsibble}
, visualize the data with {timetk}
, fit models with {fable}
and visualize the models with {coefplot}
.
Getting the Data
As discussed in the post about tracking the data, we created a CSV for each day of data in a DigitalOcean space. The CSV tracks room temperatures, thermostat settings, outdoor temperature, etc for every five-minute interval during each day.
In the past, if multiple files had to be read into a single data.frame
or tibble
the best course of action would have been to combine map_df()
from {purrr}
with read_csv()
or fread()
from {readr}
and {data.table}
, respectively. But as seen in Wes McKinney and Neal Richardson’s talk from the 2020 New York R Conference, using open_data()
from {arrow}
then collecting the data with dplyr::collect()
should be faster. An added bonus is that {arrow}
has functionality built in to pull from an S3 bucket, and that includes DigitalOcean spaces.
The first step is to use arrow::S3FileSystem$create()
to make a reference to the S3 file system. Several pieces of information are needed for this:
access_key
: The DigitalOceanAccess Key
created at https://cloud.digitalocean.com/account/api/tokens.secret_key
: The DigitalOceanSecret Key
created at https://cloud.digitalocean.com/account/api/tokens.scheme
: This should just be"https"
.endpoint_override
: Since we are using DigitalOcean and not AWS we need to change the default to something along the lines of"nyc3.digitaloceanspaces.com"
, depending on the region.
The Access Key
can be retrieved at any time, but the Secret Key
is only displayed one time, so we must save it somewhere.
This is all the same information used when writing the files to the DigitalOcean space in the first place. We also save it all in environment variables in the .Renviron
file to avoid exposing this sensitive information in our code. It is very important to not check this file into git to reduce risk of exposing the private keys.
space <- arrow::S3FileSystem$create(
access_key=Sys.getenv('AWS_ACCESS_KEY_ID'),
secret_key=Sys.getenv('AWS_SECRET_ACCESS_KEY'),
scheme="https",
endpoint_override=glue::glue("{Sys.getenv('DO_REGION')}.{Sys.getenv('DO_BASE')}")
)
space
## S3FileSystem
The space
object represents the S3 file system we have access to. Its elements are mostly file system type functions such as cd()
ls()
, CopyFile()
and GetFileInfo()
and are accessed via space$function_name()
. To see a listing of files in a folder inside a bucket we call space$ls()
with the bucket and folder name as the argument, in quotes.
space$ls('bucket_name/folder_name')
## [1] "bucket_name/folder_name/2021-01-01.csv"
## [2] "bucket_name/folder_name/2021-01-02.csv"
## [3] "bucket_name/folder_name/2021-01-03.csv"
## [4] "bucket_name/folder_name/2021-01-04.csv"
## [5] "bucket_name/folder_name/2021-01-05.csv"
## [6] "bucket_name/folder_name/2021-01-06.csv"
## [7] "bucket_name/folder_name/2021-01-07.csv"
## [8] "bucket_name/folder_name/2021-01-08.csv"
## [9] "bucket_name/folder_name/2021-01-09.csv"
## [10] "bucket_name/folder_name/2021-01-10.csv"
## [11] "bucket_name/folder_name/2021-01-11.csv"
## [12] "bucket_name/folder_name/2021-01-12.csv"
## [13] "bucket_name/folder_name/2021-01-13.csv"
## [14] "bucket_name/folder_name/2021-01-14.csv"
## [15] "bucket_name/folder_name/2021-01-15.csv"
## [16] "bucket_name/folder_name/2021-01-16.csv"
## [17] "bucket_name/folder_name/2021-01-17.csv"
## [18] "bucket_name/folder_name/2021-01-18.csv"
## [19] "bucket_name/folder_name/2021-01-19.csv"
## [20] "bucket_name/folder_name/2021-01-20.csv"
## [21] "bucket_name/folder_name/2021-01-21.csv"
## [22] "bucket_name/folder_name/2021-01-22.csv"
## [23] "bucket_name/folder_name/2021-01-23.csv"
## [24] "bucket_name/folder_name/2021-01-24.csv"
## [25] "bucket_name/folder_name/2021-01-25.csv"
## [26] "bucket_name/folder_name/2021-01-26.csv"
## [27] "bucket_name/folder_name/2021-01-27.csv"
## [28] "bucket_name/folder_name/2021-01-28.csv"
## [29] "bucket_name/folder_name/2021-01-29.csv"
## [30] "bucket_name/folder_name/2021-01-30.csv"
## [31] "bucket_name/folder_name/2021-01-31.csv"
## [32] "bucket_name/folder_name/2021-02-01.csv"
## [33] "bucket_name/folder_name/2021-02-02.csv"
## [34] "bucket_name/folder_name/2021-02-03.csv"
## [35] "bucket_name/folder_name/2021-02-04.csv"
## [36] "bucket_name/folder_name/2021-02-05.csv"
## [37] "bucket_name/folder_name/2021-02-06.csv"
## [38] "bucket_name/folder_name/2021-02-07.csv"
## [39] "bucket_name/folder_name/2021-02-08.csv"
## [40] "bucket_name/folder_name/2021-02-09.csv"
## [41] "bucket_name/folder_name/2021-02-10.csv"
## [42] "bucket_name/folder_name/2021-02-11.csv"
## [43] "bucket_name/folder_name/2021-02-12.csv"
## [44] "bucket_name/folder_name/2021-02-13.csv"
## [45] "bucket_name/folder_name/2021-02-14.csv"
## [46] "bucket_name/folder_name/2021-02-15.csv"
## [47] "bucket_name/folder_name/2021-02-16.csv"
## [48] "bucket_name/folder_name/2021-02-17.csv"
## [49] "bucket_name/folder_name/2021-02-18.csv"
## [50] "bucket_name/folder_name/2021-02-19.csv"
## [51] "bucket_name/folder_name/2021-02-20.csv"
## [52] "bucket_name/folder_name/2021-02-21.csv"
## [53] "bucket_name/folder_name/2021-02-22.csv"
## [54] "bucket_name/folder_name/2021-02-23.csv"
To use open_dataset()
we need a path to the folder holding the files. This is built with space$path()
. The structure is "bucket_name/folder_name"
. For security, even the bucket and folder names are stored in environment variables. For this example, the real values have been substituted with "bucket_name"
and "folder_name"
.
files_path <- space$path(glue::glue("{Sys.getenv('DO_SPACE')}/{Sys.getenv('DO_FOLDER')}"))
files_path
## SubTreeFileSystem: s3://bucket_name/folder_name/
Now we can call open_dataset()
to get access to all the files in that folder. We need to specify format="csv"
to let the function know the files are CSVs as opposed to parquet or other formats.
files_path <- space$path(glue::glue("{Sys.getenv('DO_SPACE')}/{Sys.getenv('DO_FOLDER')}"))
temps_dataset <- arrow::open_dataset(files_path, format='csv')
## FileSystemDataset with 66 csv files
## Name: string
## Sensor: string
## date: date32[day]
## time: string
## temperature: double
## occupancy: int64
## Thermostat: int64
## zoneAveTemp: double
## HVACmode: string
## fan: int64
## outdoorTemp: double
## outdoorHumidity: int64
## sky: int64
## wind: int64
## zoneClimate: string
## zoneCoolTemp: double
## zoneHeatTemp: double
## zoneHVACmode: string
## zoneOccupancy: int64
Printing out the temps_dataset
object shows the column names along with the type of data stored in each column. {arrow}
is pretty great and lets us do column selection and row filtering on the data sitting in files which opens up a whole world of data analysis on data too large to fit in memory. We are simply going to select columns of interest and collect all the data into one data.frame
(actually a tibble
). Notice we call select()
before collect()
because this reduces the number of columns being transmitted over the network.
library(dplyr)
temps_raw %>% head()
## # A tibble: 6 x 16
## Name Sensor date time temperature fan zoneAveTemp HVACmode
## <chr> <chr> <date> <chr> <dbl> <int> <dbl> <chr>
## 1 Upst… Bedro… 2021-01-01 00:0… 67.8 300 70 heat
## 2 Upst… Bedro… 2021-01-01 00:0… 72.4 300 70 heat
## 3 Upst… Offic… 2021-01-01 00:0… NA 300 70 heat
## 4 Upst… Upsta… 2021-01-01 00:0… 63.1 300 70 heat
## 5 Upst… Offic… 2021-01-01 00:0… NA 300 70 heat
## 6 Upst… Offic… 2021-01-01 00:0… NA 300 70 heat
## # … with 8 more variables: outdoorTemp <dbl>, outdoorHumidity <int>, sky <int>,
## # wind <int>, zoneClimate <chr>, zoneHeatTemp <dbl>, zoneCoolTemp <dbl>,
## # zoneHVACmode <chr>
Preparing the Data
The data are in need of some minor attention. First, the time
column is read in as a character
instead of time. This is a known issue, so instead we combine it with the date
column using tidyr::unite()
then convert this combination into a datetime
(or timestamp) with lubridate::as_datetime()
, which requires us to set a time zone. Then, the sensors named Upstairs Thermostat
and Downstairs Thermostat
actually represent Bedroom 3
and Dining Room
, respectively, so we rename those using case_when()
from {dplyr}
, leaving all the other sensor names as is. Then we only keep rows where the time is earlier than the present. This is due to a quirk of the Ecobee API where it can return future readings, which happens because we request all of the latest day’s data.
temps <- temps_raw %>%
tidyr::unite(col='time', date, time, sep=' ', remove=TRUE) %>%
mutate(time=lubridate::as_datetime(time, tz=Sys.timezone())) %>%
mutate(
Sensor=case_when(
Sensor == 'Upstairs Thermostat' ~ 'Bedroom 3',
Sensor == 'Downstairs Thermostat' ~ 'Dining Room',
TRUE ~ Sensor
)
) %>%
filter(time <= lubridate::now())
Skip All of This
Since the actual data may contain sensitive information a sanitized version is stored as parquet files on a public DigitalOcean space. This dataset is not as up to date but it will do for those that want to follow along.
publicspace <- arrow::S3FileSystem$create(
# anonymous means we do not need to provide credentials
# is crucial, otherwise the function looks for credentials
anonymous=TRUE,
# and crashes R if it can't find them
scheme="https",
# the data are stored in the nyc3 region
endpoint_override='nyc3.digitaloceanspaces.com'
)
publicspace$ls('temperaturedata', recursive=TRUE)
## [1] "temperaturedata/2021"
## [2] "temperaturedata/2021/1"
## [3] "temperaturedata/2021/1/part-0.parquet"
## [4] "temperaturedata/2021/2"
## [5] "temperaturedata/2021/2/part-1.parquet"
publictemp <- arrow::open_dataset(publicspace$path("temperaturedata")) %>%
collect()
publictemp %>% head()
## # A tibble: 6 x 15
## Name Sensor time temperature fan zoneAveTemp HVACmode
## <chr> <chr> <dttm> <dbl> <int> <dbl> <chr>
## 1 Upst… Bedro… 2021-01-01 00:00:00 67.8 300 70 heat
## 2 Upst… Bedro… 2021-01-01 00:00:00 72.4 300 70 heat
## 3 Upst… Offic… 2021-01-01 00:00:00 NA 300 70 heat
## 4 Upst… Bedro… 2021-01-01 00:00:00 63.1 300 70 heat
## 5 Upst… Offic… 2021-01-01 00:00:00 NA 300 70 heat
## 6 Upst… Offic… 2021-01-01 00:00:00 NA 300 70 heat
## # … with 8 more variables: outdoorTemp <dbl>, outdoorHumidity <int>, sky <int>,
## # wind <int>, zoneClimate <chr>, zoneHeatTemp <dbl>, zoneCoolTemp <dbl>,
## # zoneHVACmode <chr>
Visualizing with Interactive Plots
The first step in any good analysis is visualizing the data. In the past, I would have recommended {dygraphs}
but have since come around to {feasts}
from Rob Hyndman’s team and {timetk}
from Matt Dancho due to their ease of use, better flexibility with data types and because I personally think they are more visually pleasing.
In order to show the outdoor temperature on the same footing as the sensor temperatures, we select the pertinent columns with the outdoor data from the temps
tibble
and stack them at the bottom of temps
, after accounting for duplicates (Each sensor has an outdoor reading, which is the same across sensors).
all_temps <- dplyr::bind_rows(
temps,
temps %>%
dplyr::select(Name, Sensor, time, temperature=outdoorTemp) %>%
dplyr::mutate(Sensor='Outdoors', Name='Outdoors') %>%
dplyr::distinct()
)
Then we use plot_time_series()
from {timetk}
to make a plot that has a different colored line for each sensor and for the outdoors (with sensors being grouped into facets according to which thermostat they are associated with). Ordinarily the interactive version would be better (.interactive=TRUE
below), but for the sake of this post static displays better.
library(timetk)
all_temps %>%
group_by(Name) %>%
plot_time_series(
.date_var=time,
.value=temperature, .color_var=Sensor, .smooth=FALSE,
.interactive=FALSE
)
## Warning: Removed 34776 row(s) containing missing values (geom_path).
From this we can see a few interesting patterns. Bedroom 1
is consistently higher than the other bedrooms, but lower than the offices which are all on the same thermostat. All the downstairs rooms see their temperatures drop overnight when the thermostat is lowered to conserve energy. Between February 12 and 18 this dip was not as dramatic, due to running an experiment to see how raising the temperature on the downstairs thermostat would affect rooms on the second floor.
Computing Cross Correlations
One room in particular, Bedroom 2
, always seemed colder than all the rest of the rooms, so I wanted to figure out why. Was it affected by physically adjacent rooms? Or maybe it was controlled by the downstairs thermostat rather than the upstairs thermostat as presumed.
So the first step was to see which rooms were correlated with each other. But computing correlation with time series data isn’t as simple because we need to account for lagged correlations. This is done with the cross-correlation function (ccf).
In order to compute the ccf, we need to get the data into a time series format. While R has extensive formats, most notably the built-in ts
and its extension xts
by Jeff Ryan, we are going to use tsibble
, from Rob Hyndman and team, which are tibble
s with a time component.
The tsibble
object can treat multiple time series as individual data and this requires the data to be in long format. For this analysis, we want to treat multiple time series as interconnected, so we need to put the data into wide format using pivot_wider()
.
wide_temps <- all_temps %>%
# we only need these columns
select(Sensor, time, temperature) %>%
# make each time series its own column
tidyr::pivot_wider(
id_cols=time,
names_from=Sensor,
values_from=temperature
) %>%
# convert into a tsibble
tsibble::as_tsibble(index=time) %>%
# fill down any NAs
tidyr::fill()
wide_temps
## # A tibble: 15,840 x 12
## time `Bedroom 2` `Bedroom 1` `Office 1` `Bedroom 3` `Office 3`
## <dttm> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2021-01-01 00:00:00 67.8 72.4 NA 63.1 NA
## 2 2021-01-01 00:05:00 67.8 72.8 NA 62.7 NA
## 3 2021-01-01 00:10:00 67.9 73.3 NA 62.2 NA
## 4 2021-01-01 00:15:00 67.8 73.6 NA 62.2 NA
## 5 2021-01-01 00:20:00 67.7 73.2 NA 62.3 NA
## 6 2021-01-01 00:25:00 67.6 72.7 NA 62.5 NA
## 7 2021-01-01 00:30:00 67.6 72.3 NA 62.7 NA
## 8 2021-01-01 00:35:00 67.6 72.5 NA 62.5 NA
## 9 2021-01-01 00:40:00 67.6 72.7 NA 62.4 NA
## 10 2021-01-01 00:45:00 67.6 72.7 NA 61.9 NA
## # … with 15,830 more rows, and 6 more variables: `Office 2` <dbl>, `Living
## # Room` <dbl>, Playroom <dbl>, Kitchen <dbl>, `Dining Room` <dbl>,
## # Outdoors <dbl>
Some rooms’ sensors were added after the others so there are a number of NA
s at the beginning of the tracked time.
We compute the ccf using CCF()
from the {feasts}
package then generate a plot by piping that result into autoplot()
. Besides a tsibble
, CCF()
needs two arguments, the columns whose cross-correlation is being computed.
library(feasts)
wide_temps %>%
CCF(`Living Room`, `Bedroom 2`) %>%
autoplot()
The negative lags are when the first column is a leading indicator of the second column and the positive lags are when the first column is a lagging indicator of the second column. In this case, Living Room
seems to follow Bedroom 2
rather than lead.
We want to compute the ccf between Bedroom 2
and every other room. We call CCF()
on each column in wide_temps
against Bedroom 2
, including Bedroom 2
itself, by iterating over the columns of wide_temps
with map()
. We use a ~
to treat CCF()
as a quoted function and wrap .x
in double curly braces to take advantage of non-standard evaluation. This creates a list
of objects that can be plotted with autoplot()
. We make a list
of ggplot
objects by iterating over the ccf objects using imap()
. This allows us to use the name of each element of the list
in ggtitle
. Lastly, wrap_plots()
from {patchwork}
is used to show all the plots in one display.
library(patchwork)
library(ggplot2)
room_ccf <- wide_temps %>%
purrr::map(~CCF(wide_temps, {{.x}}, `Bedroom 2`))
# we don't want the CCF with time, outdoor temp or the Bedroom 2 itself
room_ccf$time <- NULL
room_ccf$Outdoors <- NULL
room_ccf$`Bedroom 2` <- NULL
biggest_cor <- room_ccf %>% purrr::map_dbl(~max(.x$ccf)) %>% max()
room_ccf_plots <- purrr::imap(
room_ccf,
~ autoplot(.x) +
# show the maximally cross-correlation value
# so it easier to compare rooms
geom_hline(yintercept=biggest_cor, color='red', linetype=2) +
ggtitle(.y) +
labs(x=NULL, y=NULL) +
ggthemes::theme_few()
)
wrap_plots(room_ccf_plots, ncol=3)
This brings up an odd result. Office 1
, Office 2
, Office 3
, Bedroom 1
, Bedroom 2
and Bedroom 3
are all controlled by the same thermostat, but Bedroom 2
is only really cross-correlated with Bedroom 3
(and negatively with Office 3
). Instead, Bedroom 2
is most correlated with Playroom
and Living Room
, both of which are controlled by a different thermostat.
So it will be interesting to see which thermostat setting (the set desired temperature) is most cross-correlated with Bedroom 2
. At the same time we’ll check how much the outdoor temperature cross-correlates with Bedroom 2
as well.
temp_settings <- all_temps %>%
# there is a column for outdoor temperatures so we don't need these rows
filter(Name != 'Outdoors') %>%
select(Name, time, HVACmode, zoneCoolTemp, zoneHeatTemp) %>%
# this information is repeated for each sensor, so we only need it once
distinct() %>%
# depending on the setting, the set temperature is in two columns
mutate(setTemp=if_else(HVACmode == 'heat', zoneHeatTemp, zoneCoolTemp)) %>%
tidyr::pivot_wider(id_cols=time, names_from=Name, values_from=setTemp) %>%
# get the reading data
right_join(wide_temps %>% select(time, `Bedroom 2`, Outdoors), by='time') %>%
as_tsibble(index=time)
temp_settings
## # A tibble: 15,840 x 5
## time Upstairs Downstairs `Bedroom 2` Outdoors
## <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 2021-01-01 00:00:00 71 72 67.8 31.3
## 2 2021-01-01 00:05:00 71 72 67.8 31.3
## 3 2021-01-01 00:10:00 71 72 67.9 31.3
## 4 2021-01-01 00:15:00 71 66 67.8 31.3
## 5 2021-01-01 00:20:00 71 66 67.7 31.3
## 6 2021-01-01 00:25:00 71 66 67.6 31.3
## 7 2021-01-01 00:30:00 71 66 67.6 31.2
## 8 2021-01-01 00:35:00 71 66 67.6 31.2
## 9 2021-01-01 00:40:00 71 66 67.6 31.2
## 10 2021-01-01 00:45:00 71 66 67.6 31.2
## # … with 15,830 more rows
Making this tsibble
longer allows us to use the built-in grouping and iteration that comes with CCF()
and other functions in {fable}
.
temp_settings_long <- temp_settings %>%
tidyr::pivot_longer(
cols=-c(time, `Bedroom 2`),
names_to='Thermostat', values_to='Temperature'
)
room_thermostat_ccf <- temp_settings_long %>%
CCF(Temperature, `Bedroom 2`)
room_thermostat_ccf %>% autoplot() + ggthemes::theme_few()
This plot makes a strong argument for Bedroom 2
being more affected by the downstairs thermostat than the upstairs like we presume. But I don’t think the downstairs thermostat is actually controlling the vents in Bedroom 2
because I have looked at the duct work and followed the path (along with a professional HVAC technician) from the furnace to the vent in Bedroom 2
. What I think is more likely is that the rooms downstairs get so cold (I lower the temperature overnight to conserve energy), and there is not much insulation between the floors, so the vent in Bedroom 2
can’t pump enough warm air to compensate.
I did try to experiment for a few nights (February 12 and 18) by not lowering the downstairs temperature but the eyeball test didn’t reveal any impact. Perhaps a proper A/B test is called for.
Fitting Time Series Models with {fable}
Going beyond cross-correlations, an ARIMA model with exogenous variables can give an indication if input variables have a significant impact on a time series while accounting for autocorrelation and seasonality. We want to model the temperature in Bedroom 2
while using all the other rooms, the outside temperature and the thermostat settings to see which affected Bedroom 2
the most. We fit a number of different models then compare them using AICc (AIC corrected) to see which fits the best.
To do this we need a tsibble
with each of these variables as a column. This requires us to join temp_settings
with wide_temps
.
all_ts <- wide_temps %>%
left_join(
temp_settings %>% select(time, Upstairs, Downstairs),
by='time'
)
Now we use model()
from {fable}
(technically from {fabletools}
) to fit a number of models. We make it run in parallel with plan()
from {future}
.
library(future)
plan(multisession, workers=10)
library(tictoc)
library(fable)
tic()
ts_mods <- all_ts %>%
model(
mod0=ARIMA(`Bedroom 2`)
, mod1=ARIMA(`Bedroom 2` ~ `Bedroom 1`)
, mod2=ARIMA(`Bedroom 2` ~ `Bedroom 3`)
, mod3=ARIMA(`Bedroom 2` ~ `Bedroom 1` + `Bedroom 3`)
, mod4=ARIMA(`Bedroom 2` ~ `Living Room`)
, mod5=ARIMA(`Bedroom 2` ~ Playroom)
, mod6=ARIMA(`Bedroom 2` ~ `Living Room` + Playroom)
, mod7=ARIMA(`Bedroom 2` ~ `Living Room` + Playroom +
Kitchen + `Dining Room`
)
, mod8=ARIMA(`Bedroom 2` ~ `Bedroom 1` + `Bedroom 3` +
`Living Room` + Playroom +
Kitchen + `Dining Room`
)
, mod9=ARIMA(`Bedroom 2` ~ Outdoors)
, mod10=ARIMA(`Bedroom 2` ~ Upstairs)
, mod11=ARIMA(`Bedroom 2` ~ Downstairs)
, mod12=ARIMA(`Bedroom 2` ~ Downstairs + Upstairs)
, mod13=ARIMA(`Bedroom 2` ~ Outdoors + Downstairs + Upstairs)
, mod14=ARIMA(`Bedroom 2` ~ Outdoors +
Downstairs + Upstairs +
`Living Room` + Playroom +
Kitchen + `Dining Room`
)
, mod15=ARIMA(`Bedroom 2` ~ Outdoors +
Downstairs + Upstairs +
`Bedroom 1` + `Bedroom 3`
)
, mod16=ARIMA(`Bedroom 2` ~ Outdoors +
Downstairs + Upstairs +
`Bedroom 1` + `Bedroom 3` +
`Living Room` + Playroom +
Kitchen + `Dining Room`
)
)
toc()
## 99.41 sec elapsed
This results in a tibble
of models (termed a mable
) with one column per model. We use glance()
to see the AICc for each, along with other information.
ts_mods %>%
glance() %>%
arrange(AICc)
## # A tibble: 17 x 8
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 mod3 0.0247 6723. -13427. -13427. -13358. <cpl [3]> <cpl [14]>
## 2 mod1 0.0247 6714. -13411. -13411. -13350. <cpl [3]> <cpl [14]>
## 3 mod15 0.0247 6709. -13394. -13394. -13302. <cpl [3]> <cpl [3]>
## 4 mod4 0.0257 6393. -12768. -12768. -12699. <cpl [3]> <cpl [25]>
## 5 mod2 0.0264 6199. -12381. -12381. -12320. <cpl [2]> <cpl [15]>
## 6 mod9 0.0264 6190. -12365. -12365. -12304. <cpl [2]> <cpl [26]>
## 7 mod10 0.0264 6190. -12365. -12365. -12303. <cpl [2]> <cpl [26]>
## 8 mod0 0.0264 6189. -12365. -12365. -12311. <cpl [2]> <cpl [26]>
## 9 mod13 0.0264 6176. -12331. -12331. -12255. <cpl [2]> <cpl [26]>
## 10 mod12 0.0264 6175. -12331. -12331. -12262. <cpl [2]> <cpl [26]>
## 11 mod11 0.0264 6174. -12331. -12331. -12270. <cpl [2]> <cpl [26]>
## 12 mod6 0.0195 4940. -9863. -9863. -9802. <cpl [5]> <cpl [0]>
## 13 mod5 0.0195 4932. -9849. -9849. -9796. <cpl [5]> <cpl [0]>
## 14 mod8 0.0163 4603. -9184. -9184. -9100. <cpl [4]> <cpl [0]>
## 15 mod16 0.0163 4605. -9181. -9181. -9074. <cpl [4]> <cpl [0]>
## 16 mod7 0.0171 4364. -8708. -8708. -8631. <cpl [5]> <cpl [0]>
## 17 mod14 0.0171 4365. -8704. -8704. -8605. <cpl [5]> <cpl [0]>
ts_mods %>%
glance() %>%
# sort .model according to AICc for better plotting
mutate(.model=forcats::fct_reorder(.model, .x=AICc, .fun=sort)) %>%
# smaller is better for AICc
# so negate it so that the tallest bar is best
ggplot(aes(x=.model, y=-AICc)) +
geom_col() +
theme(axis.text.x=element_text(angle=25))
The best two models (as judged by AICc) are mod3
and mod1
, some of the simpler models. The key thing they have in common is that Bedroom 1
is a predictor, which somewhat contradicts the findings from the cross-correlation function. We can examine mod3
by selecting it out of ts_mods
then calling report()
.
ts_mods %>%
select(mod3) %>%
report()
## Series: Bedroom 2
## Model: LM w/ ARIMA(3,1,2)(0,0,1)[12] errors
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 sma1 `Bedroom 1`
## 0.4497 -0.5184 0.0446 0.0650 0.3160 -0.0049 0.2723
## s.e. 0.0593 0.0471 0.0265 0.0589 0.0274 0.0082 0.0082
## `Bedroom 3`
## 0.0396
## s.e. 0.0092
##
## sigma^2 estimated as 0.02469: log likelihood=6722.64
## AIC=-13427.28 AICc=-13427.27 BIC=-13358.25
This shows that Bedroom 1
and Bedroom 3
are significant predictors with SARIMA(3,1,2)(0,0,1) errors.
The third best model is one of the more complicated models, mod15
. Not all the predictors are significant, though we should not judge a model, or the predictors, based on that.
ts_mods %>%
select(mod15) %>%
report()
## Series: Bedroom 2
## Model: LM w/ ARIMA(3,1,3) errors
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 ma3 Outdoors Downstairs
## 0.1870 -0.4426 -0.0724 0.3288 0.3745 0.0886 -0.0037 -0.0015
## s.e. 0.2351 0.0928 0.1058 0.2352 0.0490 0.0692 0.0026 0.0025
## Upstairs `Bedroom 1` `Bedroom 3`
## 0.0087 0.2718 0.0387
## s.e. 0.0066 0.0081 0.0092
##
## sigma^2 estimated as 0.02468: log likelihood=6709.22
## AIC=-13394.43 AICc=-13394.41 BIC=-13302.39
We can visualize this model with {coefplot}
after stripping out the time series coefficients.
library(coefplot)
ts_mods %>%
select(mod15) %>%
coef() %>%
filter(!stringr::str_detect(term, '^ar|ma\\d+$')) %>%
select(Coefficient=term, Value=estimate, SE=std.error, Model=.model) %>%
mutate(
LowOuter=Value - 2*SE, LowInner=Value - 1*SE,
HighInner=Value + 1*SE, HighOuter=Value + 2*SE
) %>%
arrange(Value) %>%
mutate(Coefficient=factor(Coefficient, levels=Coefficient)) %>%
coefplot()
What all these top performing models have in common is the inclusion of the other bedrooms as predictors.
While helping me interpret the data, my wife pointed out that all the colder rooms are on one side of the house while the warmer rooms are on the other side. That colder side is the one that gets hit by winds which are rather strong. This made me remember a long conversation I had with a very friendly and knowledgeable sales person at GasTec who said that wind can significantly impact a house’s heat due to it blowing away the thermal envelope. I wanted to test this idea, but while the Ecobee API returns data on wind, the value is always 0, so it doesn’t tell me anything. Hopefully, I can get that fixed.
Running Everything with {targets}
There were many moving parts to the analysis, so I managed it all with {targets}
. Each step seen in this post is a separate target in the workflow. There is even a target to build an html report from an R Markdown file. The key is to load objects created by the workflow into the R Markdown document with tar_read()
or tar_load()
. Then tar_render()
will figure out how the R Markdown document figures into the computational graph.
tar_render(
results,
'Analysis.Rmd'
)
I went over {targets}
more thoroughly in the previous post and I recommend Will Landau’s talk at the New York Open Statistical Programming Meetup to learn more.
Reporting with RStudio Connect
In addition to the R Markdown I made a shiny dashboard using the R Markdown flexdashboard format. I host this on RStudio Connect so I can get a live view of the data and examine the models. Besides from being a partner with RStudio, I really love this product and I use it for all of my internal reporting (and side projects) and we have a whole practice around getting companies up to speed with the tool.
What Comes Next?
So what can be done with this information? Based on the ARIMA model the other bedrooms on the same floor are predictive of Bedroom 2
, which makes sense since they are controlled by the same thermostat. Bedroom 3
is consistently one of the coldest in the house and shares a wall with Bedroom 2
, so maybe that wall could use some insulation. Bedroom 1
on the other hand is the hottest room on that floor, so my idea is to find a way to cool off that room (we sealed the vents with magnetic covers), thus lowering the average temperature on the floor, causing the heat to kick in, raising the temperature everywhere, including Bedroom 2
.
The cross-correlation functions also suggest a relationship with Playroom
and Living Room
, which are some of the colder rooms downstairs and the former is directly underneath Bedroom 2
. I’ve already put caulking cord in the gaps around the floorboards in the Playroom
and that has noticeably raised the temperature in there, so hopefully that will impact Bedroom 2
. We already have storm windows on just about every window and have added thermal curtains to Bedroom 2
, Bedroom 3
and the Living Room
. My wife and I also discovered air billowing in from around, not in, the fireplace in the Living Room
, so if we can close that source of cold air, that room should get warmer then hopefully Bedroom 2
as well. Lastly, we want to insulate between the first floor ceiling and second floor.
While we make these fixes, we’ll keep collecting data and trying new ways to analyze it all. It will be interesting to see how the data change in the summer.
These three posts were intended to show the entire process from data collection to automation (what some people are calling robots) to analysis, and show how data can be used to make actionable decisions. This is one more way I’m improving my life with data.
Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.