Dobby: A simple hack for on-demand remote execution

Motivation: running long jobs on your laptop sucks

At Flatiron, we run a lot of data pipelines. Early on, we were processing fairly small amounts of data, so investing a lot in tooling didn’t seem worth it. But around a year ago, we signed a contract that increased the number of patients we were processing by several orders of magnitude.

All of a sudden, some of our pipelines started to take several hours. This made development hard because the only way to test a change, say adding a new data element or refactoring some code, was to run the pipeline locally on the workstation that we were developing on.

This was far from ideal. Our laptops could become unusable while the pipeline was running; losing an internet connection for a second in the elevator could set us back hours; and there was no way to kick off a job before we left work in the evening. In short, iteration was slow and painful.

So, as a side-project at Flatiron, I started talking to some engineers on my team about how we could run our python jobs remotely during development.

It’s not that we weren’t already running jobs remotely. We did this all the time. The difference was that we were only running production jobs remotely – jobs for which someone was paged if there was a problem. We wanted a lighter-weight solution that would allow us to do this during development – when we had no idea whether a job would succeed or fail.

And we weren’t happy with the default solution of configuring a remote machine manually to match our development environment and using screen. This process had too many manual steps and did not give us any status notifications – we would still have to babysit our jobs to see if they succeeded or failed.

We wanted a simple solution that had the following characteristics:

  1. the ability to launch a remote job from our development machines with one command,
  2. the ability to execute code on an experimental branch remotely to get immediate feedback on changes during development, and
  3. notifications when the job started, failed, or finished successfully.

In other words, we wanted to be able to type something from our terminal like

launch_job my-git-branch "arbitrary command"

and have that launch the job remotely but keep us as up to date as if we were running it locally.

We could have used a framework like Aurora, Marathon, or Kubernetes, but these tools are heavier than we needed. First, we already had a production deployment solution that worked for us, so it didn’t seem worth it to invest in building and maintaining a Mesos or Kubernetes cluster just for executing code during development. Second, we were a relatively small team with jobs that, while slow, were usually harder on databases than on the processes running them. Thus we weren’t too concerned about the resource requirements of our jobs, and we didn’t want to be forced to think about this every time we launched a job. Third, the kind of alerting we wanted, while possible, didn’t come out of the box with these tools.

So, we decided to build our own lightweight solution to this problem, tailored specifically to our needs.

As in any software project, naming is key. Before we got too far into designing and building this system, I talked to my wife to brainstorm a name. She suggested “Dobby”, the house elf from Harry Potter. We liked it in part because like a house elf, this system did whatever you asked it to. But mostly we just thought that Dobby was cute.

dobby

Dobby to the rescue

With the most important step – naming – behind us, we were ready to design the system itself. What we came up with was quite simple, basically a hack.

Server side implementation

On the server side, we wrote a shell script called start.sh (installed by Chef) that took as parameters:

  • a unique identifier for the job supplied by the user
  • a branch of our git repository
  • a subdirectory of our git repository to run that command in
  • a set of emails to send job status notifications to, and an arbitrary command

The first thing that start.sh does is (1) set a trap so that if anything fails, the user is notified and (2) inform the user that the job started. The email.sh and slack.sh files referenced in the code below define simple functions to send out custom notifications via email and Slack.

source "$DIR/email.sh"
source "$DIR/slack.sh"

function send_setup_error_msg {
 email "Job '$LABEL' failed." "$CMDS" "$BRANCH" "$REL_DIR" "$LABEL" "$TIMESTAMP" "$EMAILS"
 slack "Job *$LABEL* failed." "$CMDS" "$BRANCH" "$REL_DIR" "$LABEL" "$TIMESTAMP" danger
}
trap send_setup_error_msg EXIT

email "Job '$LABEL' started." "$CMDS" "$BRANCH" "$REL_DIR" "$LABEL" "$TIMESTAMP" "$EMAILS"
slack "Job *$LABEL* started by *$FH_USER*." "$CMDS" "$BRANCH" "$REL_DIR" "$LABEL" "$TIMESTAMP" "#cc0"

After this, the script creates a new directory based on the job identifier. This directory acts as a sandbox that shields the job from any other jobs that might be running on the Dobby server concurrently.

mkdir -p $JOB_DIR

Then, the script clones our git repository into that directory. Initially, since we have some large submodules, the git clone took about 15 minutes and was dominating the startup time for jobs. But then we discovered the concept of a local reference repository, which acts as a cache for all of the objects in the repository. Using this technique reduced the time of our clone to just over a minute, thus making it much lighter weight to launch jobs. Here is the code to clone the git repository using a reference cache and check out the remote branch:

git clone --reference $GIT_REFERENCE_CACHE_DIR $REMOTE_GIT_REPO $REPO_DIR
pushd $REPO_DIR
git fetch
git checkout --detach
git branch -D $BRANCH 2> /dev/null || true
git checkout -f $BRANCH
git submodule init
git submodule sync --recursive
git submodule update --recursive -f --reference $GIT_REFERENCE_CACHE_DIR
popd

Next, start.sh sets up a python virtualenv and uses pip to install all of our third-party packages. This is also quite fast because of caching. This code is specific to our environment, but it could be substituted for whatever is needed to rebuild the development environment. The important thing for us is that we could use pip to isolate the environment within the job directory so that it could be different for each job. Of course, a tool like Docker could also help with this.

Having set up the environment, start.sh next runs the actual command.

echo " COMMAND: $cmd"
echo " START TIME: $(date)"
eval "$cmd"
ret=$?
if [[ $ret -ne 0 ]]; then
 email "Job '$LABEL' failed :(" "$cmd" "$BRANCH" "$REL_DIR" "$LABEL" "$TIMESTAMP" "$EMAILS"
 slack "Job *$LABEL* failed :(" "$cmd" "$BRANCH" "$REL_DIR" "$LABEL" "$TIMESTAMP" danger
 exit $ret
fi

The standard output and standard error from start.sh is piped to a known location based on the job identifier and timestamp. Every notification sent by Dobby includes a command to download and view these log files with one command. Thus an engineer can always figure out what went right or wrong.

Client side implementation

The start.sh script took care of our second and third requirements, but it didn’t address the first – the ability to launch a job remotely with a single command on an engineer’s development machine. For that, we used fabric, a simple python library for scripting remote execution over ssh, and dtach, a small utility that emulates the part of screen that allows jobs to keep running even after the ssh session terminates. A fabric command called dobby.start calls start.sh (wrapped by dtach) and forwards the parameters supplied by the user to start.sh. Here is the fabric command for starting a job.

def start(job_label, branch, cmd, dir='.'):
   """Starts a dobby job, as long as it is not already running."""
   users = env.users_to_notify.split(':')
   emails = [user + '@flatiron.com' for user in users]
   email_str = ','.join(emails)

   # The dtach utility maintains a socket for each running job. Use this as a lock on job name.  
   socket_path = os.path.join(_SOCKET_DIR, 'socket-' + job_label)
   if exists(socket_path, use_sudo=True):
       print >> sys.stderr, '\nERROR: Job already running.
       sys.exit(1)

   prefix = 'dtach -n ' + socket_path  # Create socket but don't attach to it.
   cmd = prefix + (' bash {script_dir}/start.sh {user} {job_label} {branch} {email_str} {dir} ' +
                   '{cmd}').format(
                       user=env.user, script_dir=_SCRIPT_DIR, job_label=job_label, branch=branch,
                       cmd=cmd, email_str=email_str, dir=dir)

   sudo(cmd, user=env.application_user)

With this code, all the user needs to do to launch a job remotely is run a command that looks like

fab dobby.start:my-job-id,master,'my command'

We also wrote fabric commands for stopping running jobs (using a pid file written inside the job directory by start.sh), restarting jobs (which just stops the job and then starts it), cleaning job directories, and downloading logs. Here is the fabric code to stop a running job

def stop(job_label):
    """Stops a dobby job."""
    _check_job_label(job_label)
    cmd = 'bash {script_dir}/stop.sh {job_label}'.format(
        script_dir=_SCRIPT_DIR, job_label=job_label)
    sudo(cmd, user=env.application_user)

where stop.sh looks like

#!/usr/bin/bash
#
# Script to stop a Dobby job.  This is called by the dobby.stop, dobby.restart, and dobby.clean fabric tasks.

LABEL=$1

DIR=$(cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd)
SOCKET_DIR="$DIR/sockets"
PID_DIR="$DIR/pid_files"

source "$DIR/slack.sh"

if [ ! -f $PID_DIR/$LABEL ]; then
 echo "Process id for job $LABEL unknown."
 exit 0
fi

pid=$(cat $PID_DIR/$LABEL)

if ! kill -0 $pid 2> /dev/null; then
 echo "Job $LABEL already stopped."
 exit 0
fi

# The process ID of the bash shell will be the process group id of all child processes.
# These commands kill all processes in the group: http://stackoverflow.com/questions/392022.
kill -2 -$pid || true  # Try SIGINT signal first.
sleep 3
if kill -0 $pid 2> /dev/null; then
 kill -KILL -$pid || true
fi

rm -f $SOCKET_DIR/socket-$LABEL

Lessons learned

Nine months after building Dobby, we are using it every day. Dobby is quick enough that we use it for jobs that take 15 minutes, and it is robust enough that we use it for jobs that take 15 hours. It is hard for us to imagine life before Dobby, when we could often be seen walking to meetings with our laptops cracked open to avoid disrupting a job.

Dobby posting to our Slack channel:

slack

Two lessons in particular stand out:

  1. Keeping it simple was the right choice. In principle, there is nothing stopping a bunch of large jobs running concurrently from taking down the server. But the dozen or so engineers using Dobby have been able to use manual coordination techniques, like Slack, to avoid this. It is easy to start coming up with extra Dobby features that sound good on paper, like job dependencies, auto-rerun failures, or load balancing on multiple machines, but each of those features would’ve added extra complexity to manage, and so far we haven’t needed any of that for our workflows. As a result, surprisingly little engineering time has been devoted to maintaining Dobby, and we can instead focus on delivering high quality data.
  2. Public notifications are really nice. Our initial design had just email notifications. These were helpful for the person who launched the job (who was emailed automatically). But eventually we realized more visible notifications would also be helpful. At Flatiron, we are heavy Slack users, so this seemed like a natural way to keep a wider group of people informed as to what Dobby was doing. We created a Slack integration with a channel dedicated to Dobby. As shown in the code above, people can subscribe to that channel to see every time a job starts or stops on Dobby, along with who started or stopped that job. This serves as a lightweight log of our activity and has helped us stay more informed about what everyone else on our team is doing – very important for us making tight deadlines.

A few things have been a bit harder than we thought, especially around maintaining our Chef scripts for configuring the Dobby machine. It requires some effort to keep our credentials up to date, and it hasn’t always been easy ensuring that it can run some of our more esoteric R code. But overall, Dobby has been a huge time saver for us.