In a recent post I talked about collecting temperature data from different rooms in my house. Using {targets}
, I am able to get temperature readings for any given day down to five-minute increments and store that data on a DigitalOcean space.
Since I want to be able to both fit models on the data and track temperatures close to real time I needed this to run on a regular basis, say every 30-60 minutes. I have a machine learning server that Kaz Sakamoto, Michael Beigelmacher and I built from parts we bought and a collection of Raspberry Pis, but I didn’t want to maintain the hardware for this, I wanted something that was handled for me.
Luckily GitHub Actions can run arbitrary code on a schedule and provides free compute for public repositories. So I built a Docker image and GitHub Actions configuration files to make it all happen. This all is based on the templogging repository on GitHub.
Docker Image
To ensure everything works as expected and to prevent package updates from breaking code, I used a combination of {renv}
and Docker.
{renv}
allows each individual project to have its own set of packages isolated from all other projects’ packages. It also keeps track of package versions so that the project can be restored on other computers, or in a Docker container, in the exact same configuration.
Using {renv}
is fairly automatic. When starting a new project run renv::init()
to create the isolated package library and start tracking packages. Then packages are installed as usual with install.packages()
or renv::install()
. Then renv::snapshot()
is periodically used to write the specific package versions to the renv.lock
file which is saved in the root directory of the project.
Docker is like a lightweight virtual machine that can create a specific environment. For this project I wanted a specific version of R (4.0.3), a handful of system libraries like libxml2-dev
(for {igraph}
, which is needed for {targets}
) and libcurl4-openssl-dev
(for {httr}
) and all of the packages tracked with {renv}
.
The R version is provided by starting with the corresponding r-ver
image from the rocker project. The system libraries are installed using apt install
.
The R packages are installed using renv::restore()
but this requires files from the project to be added to the image in the right order. First, the templogging
directory is created to hold everything, then individual files needed for the project and {renv}
are added to that directory: templogging.Rproj
, .Rprofile
and renv.lock
. Then the entire renv
directory is added as well. After that, the working directory is changed to the templogging
directory so that the following commands take place in that location.
Then comes time to install all the packages tracked in renv.lock
. This is done with RUN Rscript -e "renv::restore(repos='https://packagemanager.rstudio.com/all/__linux__/focal/latest', confirm=FALSE)"
. Rscript -e
runs R code as if it were executed in an R session. renv::restore()
installs all the specific packages. Setting repos='https://packagemanager.rstudio.com/all/__linux__/focal/latest'
specifies to use prebuilt binaries from the public RStudio Package Manager (some people need the commercial, private version and if you think this may be you, my company, Lander Analytics can help you get started).
After the packages are installed, the rest of the files in the project are copied into the image with ADD . /templogging
. It is important that this step takes place after the packages are installed so that changes to code does not trigger a reinstall of the packages.
This is the complete Dockerfile
.
FROM rocker/r-ver:4.0.3
# system libraries
RUN apt update && \
apt install -y --no-install-recommends \
# for igraph
libxml2-dev \
libglpk-dev \
libgmp3-dev \
# for httr
libcurl4-openssl-dev \
libssl-dev && \
# makes the image smaller
rm -rf /var/lib/apt/lists/*
# create project directory
RUN mkdir templogging
# add some specific files
ADD ["templogging.Rproj", ".Rprofile", "renv.lock", "/templogging/"]
# add all of the renv folder, except the library which is marked out in .dockerignore
ADD renv/ /templogging/renv
# make sure we are in the project directory so restoring the packages works cleanly
WORKDIR /templogging
# this restores the desired packages (including specific versions)
# it uses the public RStudio package manager to get binary versions of the packages
# this is for faster installation
RUN Rscript -e "renv::restore(repos='https://packagemanager.rstudio.com/all/__linux__/focal/latest', confirm=FALSE)"
# then we add in the rest of the project folder, including all the code
# we do this separately so that we can change code without having to reinstall all the packages
ADD . /templogging
To make building and running the image easier I made a docker-compose.yml
file.
version: '3.4'
services:
templogging:
image: jaredlander/templogging
container_name: tempcheck
build:
context: .
dockerfile: ./Dockerfile
environment:
- BUCKET_NAME=${BUCKET_NAME}
- FOLDER_NAME=${FOLDER_NAME}
- AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
- AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
- AWS_DEFAULT_REGION=${AWS_DEFAULT_REGION}
- AWS_S3_ENDPOINT=${AWS_S3_ENDPOINT}
- AWS_SESSION_TOKEN=${AWS_SESSION_TOKEN}
- TZ=${TZ}
So we can now build and upload the image with the following shell code.
docker-compose build
docker push jaredlander/templogging:latest
Now the image is available to be used anywhere. Since this image, and all the code are public, it’s important that nothing personal or private was saved in the code. As can be seen in the docker-compose.yml
file or any of the R code in the repo, all potentially sensitive information was stored in environment variables.
The {targets}
workflow can be executed inside the Docker container with the following shell command.
docker-compose run templogging R -e "targets::tar_make()"
That will get and write data for the current date. To run this for a specific date, a variable called process_date
should be set to that date (assumes the TZ
environment variable has been set) and tar_make()
should be called with the callr_function
argument set to NULL
. I’m sure there is a better way to set arguments to make tar_make()
of runtime settings, but I haven’t figured it out yet.
docker-compose run templogging R -e "process_date <- as.Date('2021-02-20')" -e "targets::tar_make(callr_function=NULL)"
GitHub Actions
After the Docker image was built, I automated the whole process using GitHub Actions. This requires a yml
file inside the .github/workflows
folder in the repository.
The first section is on:
. This tells GitHub to run this job whenever there is a push to the main
or master
branch and whenever there is a pull request to these branches. It also runs every 30 minutes thanks to the schedule block that has - cron: '*/30 * * * *'
. The */30
means “every thirty” and it’s the first position which means minutes (0-59). The second position would be hours (0-23). The following positions are day of month (1-31), month of year (1-12), day of the week (1-7, with 1 as Monday) and year (1900-3000).
The next section, name:
, just provides the name of the workflow.
Then comes the jobs:
section. Multiple jobs can be run, but there is only one job called Run-Temp-Logging
.
The job runs on (runs-on:
) ubuntu-latest
as opposed to Windows or macOS.
This is going to use the Docker container so the container:
block specifies an image:
and environment variables (env:
). The environment variables are stored securely in GitHub and are accessed via ${{ secrets.VARIABLE_NAME }}
. Since I created this I have been debating if it would be faster to install the packages directly in the virtual machine spun up by the actions runner than to use a Docker image. Given that R package installations can be cached it might work out, but I haven’t tested it yet.
The steps:
block runs the {targets}
workflow. There can be multiple steps, though in this case there is only one, so each starts with a dash (-
). For the step, a name:
is given, along with what to run (run: R -e "targets::tar_make()"
) and the working directory (working-directory: /templogging
), each on a separate line in the file. The R -e "targets::tar_make()"
command runs in the Docker container with /templogging
as the working directory. This has the same result as running docker-compose run templogging R -e "targets::tar_make()"
locally. Each time this is run the file in the DigitalOcean space gets written anew.
The entire yml
file is below.
on:
push:
branches:
- main
- master
pull_request:
branches:
- main
- master
schedule:
- cron: '*/30 * * * *'
name: Run-Targets
jobs:
Run-Temp-Logging:
runs-on: ubuntu-latest
container:
image: jaredlander/templogging:latest
env:
BUCKET_NAME: ${{ secrets.BUCKET_NAME }}
FOLDER_NAME: ${{ secrets.FOLDER_NAME }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: ${{ secrets.AWS_DEFAULT_REGION }}
AWS_S3_ENDPOINT: ${{ secrets.AWS_S3_ENDPOINT }}
AWS_SESSION_TOKEN: ""
ECOBEE_API_KEY: ${{ secrets.ECOBEE_API_KEY }}
ECOBEE_REFRESH_TOKEN: ${{ secrets.ECOBEE_REFRESH_TOKEN }}
steps:
- name: Run targets workflow
run: R -e "targets::tar_make()"
working-directory: /templogging
With this running every 30 minutes I can pull the data just about any time and have close-enough-to-live insight into my house. But let’s be honest, I really only need to see up to the past few days to make sure everything is all right, so a 30-minute delay is more than good enough.
End of Day
While the job is scheduled to run every thirty minutes, I noticed that it wasn’t exact, so I couldn’t be sure I would capture the last few five-minute intervals of each day. So I made another GitHub Action to pull the prior day’s data at 1:10 AM.
The instructions are mostly the same with just two changes. First, the schedule is - cron: '10 1 * * *'
which means on the 10th minute of the first hour. Since the rest of the slots are *
the job occurs on every one of them, so it is the 10th minute of the first hour of every day of every month of every year.
The second change is the command given to the Docker container to run the {targets}
workflow. As alluded to earlier, the _targets.R
file is designed so that if Process_date
is assigned a date, the data will be pulled for that date instead of the current date. So the run:
portion has the command R -e "Sys.setenv(TZ='${{ secrets.TZ }}')" -e "process_date <- Sys.Date() - 1" -e "targets::tar_make(callr_function=NULL)"
. This makes sure the timezone environment variable is set in R, process_date
gets the value of the prior day’s date and tar_make()
is called.
The entire yml
file is below.
on:
push:
branches:
- main
- master
pull_request:
branches:
- main
- master
schedule:
- cron: '10 1 * * *'
name: Run-Targets
jobs:
Run-Temp-Logging:
runs-on: ubuntu-latest
container:
image: jaredlander/templogging:latest
env:
BUCKET_NAME: ${{ secrets.BUCKET_NAME }}
FOLDER_NAME: ${{ secrets.FOLDER_NAME }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: ${{ secrets.AWS_DEFAULT_REGION }}
AWS_S3_ENDPOINT: ${{ secrets.AWS_S3_ENDPOINT }}
AWS_SESSION_TOKEN: ""
ECOBEE_API_KEY: ${{ secrets.ECOBEE_API_KEY }}
ECOBEE_REFRESH_TOKEN: ${{ secrets.ECOBEE_REFRESH_TOKEN }}
steps:
- name: Run targets workflow
run: R -e "Sys.setenv(TZ='${{ secrets.TZ }}')" -e "process_date <- Sys.Date() - 1" -e "targets::tar_make(callr_function=NULL)"
working-directory: /templogging
What’s Next?
Now that the data are being collected it’s time to analyze and see what is happening in the house. This is very involved so that will be the subject of the next post.
Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.
Leave a Reply