Getting People Started
A large part of my work is teaching R–for private clients, at Columbia Business School, at conferences and facilitating public workshops for others.
A common theme is that getting everyone setup on their individual computers is very difficult. No matter how many instructions I provide, there are always a good number of people without a proper environment. This can mean not using RStudio projects, not having the right packages installed, not downloading the data and sometimes not even installing R.
Solution
After many experiments I finally came upon a solution. For every class I teach I now create a skeleton project hosted on GitHub with instructions for setup.
The instructions (in the README) consist of three blocks of code.
- Package installation
- Copying the project structure from the repo (no git required)
- Downloading data
All the user has to do is copy and paste these three blocks of code into the R console and they have the exact same environment as the instructor and other students.
packages <- c(
'coefplot',
'rprojroot',
'tidyverse',
'usethis'
)
install.packages(packages)
newProject <- usethis::use_course('https://github.com/jaredlander/WorkshopExampleRepo/archive/master.zip')
source('prep/DownloadData.r')
Using this process, 95% of my students are prepared for class.
The inspiration for this idea came from a fun coffee with Hadley Wickham and Jenny Bryan during a conference in New Zealand and the implementation is made possible thanks to the usethis
package.
Automating the Setup
Now that I found a good way to get students started, I wanted to make it easier for me to setup the repo. So I created an R package called RepoGenerator
and put it on CRAN.
The first step to using the package is to create a GitHub Personal Access Token (instructions are in the README). Then you build a data.frame
listing datasets you want the students to download. The data.frame
needs at least the following three columns.
Local
: The name, not path, the file should have on diskRemote
: The URL where the data files are stored onlineMode
: The mode needed to write the file to disk, ‘w’ for regular text files, ‘wb’ for binary files such as Excel or rds files
An example data.frame
is available in the RepoGenerator
package.
data(datafiles, package='RepoGenerator')
datafiles[1:6, c('Local', 'Remote', 'Mode')]
Local | Remote | Mode |
---|---|---|
DiamondColors.csv | https://query.data.world/s/uVlTdijkCbfac49-3k12tawsmviArp | w |
diamonds.db | https://query.data.world/s/Z5k9W39e1kD5hzcJIcRlFClhIHnw5v | wb |
ExcelExample.xlsx | https://query.data.world/s/5wa6K_X91yfkf-BVpRe2UIabO5A-QB | wb |
FavoriteSpots.json | https://query.data.world/s/033kPeDH9pMdcnhPRIOwhjrw3lpA10 | w |
flightPaths.csv | https://query.data.world/s/IIwWxfh9cTydB8h_OueRyA7yxvZ6bf | w |
reaction.txt | https://query.data.world/s/uDfiLMRxSiB_kQQhEt_LbDGVOcStBR | w |
After that you define the packages you want your students to use. There can be as few or as many as you want. In addition to any packages you list, rprojroot
and usethis
are added so that the instructions in the new repo will be certain to work.
packages <- c('caret', 'coefplot','DBI', 'dbplyr', 'doParallel', 'dygraphs',
'foreach', 'ggthemes', 'glmnet', 'jsonlite', 'leaflet', 'odbc',
'recipes', 'rmarkdown', 'rprojroot', 'RSQLite', 'rvest',
'tidyverse', 'threejs', 'usethis', 'UsingR', 'xgboost', 'XML',
'xml2')
Now all you need to do is call the createRepo()
function.
createRepo(
# the name to use for the repo and project
name='WorkshopExampleRepo',
# the location on disk to build the project
path='~/WorkshopExampleRepo',
# the data.frame listing data files for the user to download
data=datafiles,
# vector of packages the user should install
packages=packages,
# the GitHub username to create the repo for
user='jaredlander',
# the new repo's README has the name of who is organizing the class
organizer='Lander Analytics',
# the name of the environment variable storing the GitHub Personal Access Token
token='MyGitHubPATEnvVar'
)
After this you will have a new repo setup for your users to copy, including instructions.
That’s All
Reducing setup issues at the start of a training can really improve the experience for everyone and allow you to get straight into teaching.
Please check it out and let me know how it works for you.
Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.