Continuously Deploying Django with Docker

By anders pearson 06 Dec 2015

I run about a dozen personal Django applications (including this site) on some small servers that I admin. I also run a half dozen or so applications written in other languages and other frameworks.

Since it’s a heterogeneous setup and I have a limited amount of free time for side projects, container technology like Docker that lets me standardize my production deployment is quite appealing.

I run a continuous deployment pipeline for all of these applications so every git commit I make to master goes through a test pipeline and ends up deployed to the production servers (assuming all the tests pass).

Getting Django to work smoothly in a setup like this is non-trivial. This post attempts to explain how I have it all working.

Background

First, some background on my setup. I run Ubuntu 14.04 servers on Digital Ocean. Ubuntu 14.04 still uses upstart as the default init, so that’s what I use to manage my application processes. I back the apps with Postgres and I run an Nginx proxy in front of them. I serve static assets via S3 and Cloudfront. I also use Salt for config management and provisioning so if some of the config files here look a bit tedious or tricky to maintain and keep in sync, keep in mind that I’m probably actually using Salt to template and automate them. I also have a fairly extensive monitoring setup that I won’t go into here, but will generally let me know as soon as anything goes wrong.

I currently have three “application” servers where the django applications themselves run. Typically I run each application on two servers which Nginx load balances between. A few of the applications also use Celery for background jobs and Celery Beat for periodic tasks. For those, the celery and celery beat processes run on the third application server.

My goal for my setup was to be able to deploy new versions of my Django apps automatically and safely just by doing git push origin master (which typically pushes to a github repo). That means that the code needs to be tested, a new Docker image needs to be built, distributed to the application servers, database migrations run, static assets compiled and pushed to S3, and the new version of the application started in place of the old. Preferably without any downtime for the users.

I’ll walk through the setup for my web-based feedreader app, antisocial, since it is one of the ones with Celery processes. Other apps are all basically the same except they might not need the Celery parts.

I should also point out that I am perpetually tweaking stuff. This is what I’m doing at the moment, but it will probably outdated soon after I publish this as I find other things to improve.

Dockerfile

Let’s start with the Dockerfile:

Dockerfile:

FROM ccnmtl/django.base
ADD wheelhouse /wheelhouse
RUN apt-get update && apt-get install -y libxml2-dev libxslt-dev
RUN /ve/bin/pip install --no-index -f /wheelhouse -r /wheelhouse/requirements.txt
WORKDIR /app
COPY . /app/
RUN /ve/bin/python manage.py test
EXPOSE 8000
ADD docker-run.sh /run.sh
ENV APP antisocial
ENTRYPOINT ["/run.sh"]
CMD ["run"]

Like most, I started using Docker by doing FROM ubuntu:trusty or something similar at the beginning of all my Dockerfiles. That’s not really ideal though and results in large docker images that are slow to work with so I’ve been trying to get my docker images as slim and minimal as possible lately.

Roughly following Glyph’s approach, I split the docker image build process into a base image and a “builder” image so the final image can be constructed without the whole compiler toolchain included. The base and builder images I have published as ccnmtl/django.base and ccnmtl/django.build respectively and you can see exactly how they are made here.

Essentially, they both are built on top of Debian Jessie (quite a bit smaller than Ubuntu images and similar enough). The base image contains the bare minimum while the build image contains a whole toolchain for building wheels out of python libraries. I have a Makefile with some bits like this:

ROOT_DIR:=$(shell dirname $(realpath $(lastword $(MAKEFILE_LIST))))
APP=antisocial
REPO=thraxil
WHEELHOUSE=wheelhouse

$(WHEELHOUSE)/requirements.txt: $(REQUIREMENTS)
        mkdir -p $(WHEELHOUSE)
        docker run --rm \
        -v $(ROOT_DIR):/app \
        -v $(ROOT_DIR)/$(WHEELHOUSE):/wheelhouse \
        ccnmtl/django.build
        cp $(REQUIREMENTS) $(WHEELHOUSE)/requirements.txt
        touch $(WHEELHOUSE)/requirements.txt

build: $(WHEELHOUSE)/requirements.txt
        docker build -t $(IMAGE) .

So when I do make build, if the requirements.txt has changed since the last time, it uses the build image to generate a directory with wheels for every library specified in requirements.txt, then runs docker build, which can do a very simple (and fast) pip install of those wheels.

Once the requirements are installed, it runs the application’s unit tests. I expose port 8000 and copy in a custom script to use as an entry point.

docker-run.sh

That script makes the container a bit easier to work with. It looks like this:

#!/bin/bash

cd /app/

if [[ "$SETTINGS" ]]; then
    export DJANGO_SETTINGS_MODULE="$APP.$SETTINGS"
else
    export DJANGO_SETTINGS_MODULE="$APP.settings_docker"
fi

if [ "$1" == "migrate" ]; then
    exec /ve/bin/python manage.py migrate --noinput
fi

if [ "$1" == "collectstatic" ]; then
    exec /ve/bin/python manage.py collectstatic --noinput
fi

if [ "$1" == "compress" ]; then
    exec /ve/bin/python manage.py compress
fi

if [ "$1" == "shell" ]; then
    exec /ve/bin/python manage.py shell
fi

if [ "$1" == "worker" ]; then
    exec /ve/bin/python manage.py celery worker
fi

if [ "$1" == "beat" ]; then
    exec /ve/bin/python manage.py celery beat
fi

# run arbitrary commands
if [ "$1" == "manage" ]; then
    shift
    exec /ve/bin/python manage.py "$@"
fi

if [ "$1" == "run" ]; then
    exec /ve/bin/gunicorn --env \
                 DJANGO_SETTINGS_MODULE=$DJANGO_SETTINGS_MODULE \
                 $APP.wsgi:application -b 0.0.0.0:8000 -w 3 \
                 --access-logfile=- --error-logfile=-
fi

With the ENTRYPOINT and CMD set up that way in the Dockerfile, I can just run

$ docker run thraxil/antisocial

and it will run the gunicorn process, serving the app on port 8000. Or, I can do:

$ docker run thraxil/antisocial migrate

and it will run the database migration task. Similar for collectstatic, compress, celery, etc. Or, I can do:

$ docker run thraxil/antisocial manage some_other_command --with-flags

to run any other Django manage.py command (this is really handy for dealing with migrations that need to be faked out, etc.)

docker-runner

Of course, all of those exact commands would run into problems with needing various environment variables passed in, etc.

The settings_docker settings module that the script defaults to for the container is a fairly standard Django settings file, except that it pulls all the required settings out of environment variables. The bulk of it comes from a common library that you can see here.

This gives us a nice twelve-factor style setup and keeps the docker containers very generic and reusable. If someone else wants to run one of these applications, they can pretty easily just run the same container and just give it their own environment variables.

The downside though is that it gets a bit painful to actually run things from the commandline, particularly for one-off tasks like database migrations because you actually need to specify a dozen or so -e flags on every command.

I cooked up a little bit of shell script with a dash of convention over configuration to ease that pain.

All the servers get a simple docker-runner script that looks like:

#!/bin/bash                                                                                                                         

APP=$1
shift

IMAGE=
OPTS=
if [ -f /etc/default/$APP ]; then
  . /etc/default/$APP
fi
TAG="latest"
if [ -f /var/www/$APP/TAG ]; then
  . /var/www/$APP/TAG
fi

exec /usr/bin/docker run $OPTS $EXTRA $IMAGE:$TAG $@

That expects that every app has a file in /etc/default that defines an $IMAGE and $OPTS variable. Eg, antisocial’s looks something like:

/etc/default/antisocial:

export IMAGE="thraxil/antisocial"
export OPTS="--link postfix:postfix \
     --rm \
     -e SECRET_KEY=some_secret_key \
     -e AWS_S3_CUSTOM_DOMAIN=d115djs1mf98us.cloudfront.net \
     -e AWS_STORAGE_BUCKET_NAME=s3-bucket-name \
     -e AWS_ACCESS_KEY=... \
     -e AWS_SECRET_KEY=... \
     ... more settings ... \
     -e ALLOWED_HOSTS=.thraxil.org \
     -e BROKER_URL=amqp://user:pass@host:5672//antisocial"

With that in place, I can just do:

 $ docker-runner antisocial migrate

And it fills everything in. So I can keep the common options in one place and not have to type them in every time.

(I’ll get to the TAG file that it mentions in a bit)

upstart

With those in place, the upstart config for the application can be fairly simple:

/etc/init/antisocial.conf:

description "start/stop antisocial docker"

start on filesystem and started docker-postfix and started registrator
stop on runlevel [!2345]

respawn

script
  export EXTRA="-e SERVICE_NAME=antisocial -p 192.81.1.1::8000"
  exec /usr/local/bin/docker-runner antisocial
end script

The Celery and Celery Beat services have very similar ones except they run celery and beat tasks instead and they don’t need to have a SERVICE_NAME set or ports configured.

Consul

Next, I use consul, consul-template, and registrator to rig everything up so Nginx automatically proxies to the appropriate ports on the appropriate application servers.

Each app is registered as a service (hence the SERVICE_NAME parameter in the upstart config). Registrator sees the containers starting and stopping and registers and deregisters them with consul as appropriate, inspecting them to get the IP and port info.

consul-template runs on the Nginx server and has a template defined for each app that looks something like:

{{if service "antisocial"}}
upstream antisocial {
{{range service "antisocial"}}    server {{.Address}}:{{.Port}};
{{end}}
}
{{end}}

server {
        listen   80;
        server_name  feeds.thraxil.org;

        client_max_body_size 40M;
{{if service "antisocial"}}
        location / {
          proxy_pass http://antisocial;
          proxy_next_upstream     error timeout invalid_header http_500;
          proxy_connect_timeout   2;
          proxy_set_header        Host            $host;
          proxy_set_header        X-Real-IP       $remote_addr;
         }
        error_page 502 /502.html;
        location = /502.html {
          root   /var/www/down/;
        }
{{else}}
        root /var/www/down/;
        try_files $uri /502.html;
{{end}}
}

That just dynamically creates an endpoint for each running app instance pointing to the right IP and port. Nginx then round-robins between them. If none are running, it changes it out to serve a “sorry, the site is down” kind of page instead. Consul-template updates the nginx config and reloads nginx as soon as any changes are seen to the service. It’s really nice. If I need more instances of a particular app running, I can just spin one up on another server and it instantly gets added to the pool. If one crashes or is shut down, it’s removed just as quickly. As long as one there’s at least one instance running at any given time, visitors to the site should never be any the wiser (as long as it can handle the current traffic).

That really covers the server and application setup.

What’s left is the deployment part. Ie, how it gets from a new commit on master to running on the application servers.

Jenkins

Jenkins is kind of a no-brainer for CI/CD stuff. I could probably rig something similar up with TravisCI or Wercker or another hosted CI, but I’m more comfortable keeping my credentials on my own servers for now.

So I have a Jenkins server running and I have a job set up there for each application. It gets triggered by a webhook from github whenever there’s a commit to master.

Jenkins checks out the code and runs:

export TAG=build-$BUILD_NUMBER
make build
docker push thraxil/antisocial:$TAG

$BUILD_NUMBER is a built-in environment variable that Jenkins sets on each build. So it’s just building a new docker image (which runs the test suite as part of the build process) and pushes it to the Docker Hub with a unique tag corresponding to this build.

When that completes successfully, it triggers a downstream Jenkins job called django-deploy which is a parameterized build. It passes it the following parameters:

APP=antisocial
TAG=build-$BUILD_NUMBER
HOSTS=appserver1 appserver2
CELERY_HOSTS=appserver3
BEAT_HOSTS=appserver3

These are fairly simple apps that I run mostly for my own amusement so I don’t have extensive integration tests. If I did, instead of triggering django-deploy directly here, I would trigger other jobs to run those tests against the newly created and tagged image first.

The django-deploy job runs the following script:

#!/bin/bash
hosts=(${HOSTS})
chosts=(${CELERY_HOSTS})
bhosts=(${BEAT_HOSTS})

for h in "${hosts[@]}"
do
    ssh $h docker pull ${REPOSITORY}thraxil/$APP:$TAG
    ssh $h cp /var/www/$APP/TAG /var/www/$APP/REVERT || true
    ssh $h "echo export TAG=$TAG > /var/www/$APP/TAG"
done

for h in "${chosts[@]}"
do
    ssh $h docker pull ${REPOSITORY}thraxil/$APP:$TAG
    ssh $h cp /var/www/$APP/TAG /var/www/$APP/REVERT || true
    ssh $h "echo export TAG=$TAG > /var/www/$APP/TAG"
done

for h in "${bhosts[@]}"
do
    ssh $h docker pull ${REPOSITORY}thraxil/$APP:$TAG
    ssh $h cp /var/www/$APP/TAG /var/www/$APP/REVERT || true
    ssh $h "echo export TAG=$TAG > /var/www/$APP/TAG"
done

h=${hosts[0]}

ssh $h /usr/local/bin/docker-runner $APP migrate
ssh $h /usr/local/bin/docker-runner $APP collectstatic
ssh $h /usr/local/bin/docker-runner $APP compress

for h in "${hosts[@]}"
do
    ssh $h sudo stop $APP || true
    ssh $h sudo start $APP
done

for h in "${chosts[@]}"
do
    ssh $h sudo stop $APP-worker || true
    ssh $h sudo start $APP-worker
done

for h in "${bhosts[@]}"
do
    ssh $h sudo stop $APP-beat || true
    ssh $h sudo start $APP-beat
done

It’s a bit long, but straightforward.

First, it just pulls the new docker image down onto each server. This is done first because the docker pull is usually the slowest part of the process. Might as well get it out of the way first. On each host, it also writes to the /var/www/$APP/TAG file that we saw mentioned back in docker-runner. The contents are just a variable assignment specifying the tag that we just built and pulled and are about to cut over to. The docker-runner script knows to use the specific tag that’s set in that file. Of course, it first backs up the old one to a REVERT file that can then be used to easily roll-back the whole deploy if something goes wrong.

Next, the database migrations and static asset tasks have to run. They only need to run on a single host though, so it just pulls the first one off the list and runs the migrate, collectstatic, and compress on that one.

Finally, it goes host by host and stops and starts the service on each in turn. Remember that with the whole consul setup, as long as they aren’t all shut off at the same time, overall availability should be preserved.

Then, of course, it does the same thing for the celery and celery beat services.

If that all completes successfully, it’s done. If it fails somewhere along the way, another Jenkins job is triggered that basically restores the TAG file from REVERT and restarts the services, putting everything back to the previous version.

Conclusion and Future Directions

That’s a lot to digest. Sorry. In practice, it really doesn’t feel that complicated. Mostly stuff just works and I don’t have to think about it. I write code, commit, and git push. A few minutes later I get an email from Jenkins telling me that it’s been deployed. Occasionally, Jenkins tells me that I broke something and I go investigate and fix it (while the site stays up). If I need more capacity, I provision a new server and it joins the consul cluster. Then I can add it to the list to deploy to, kick off a Jenkins job and it’s running. I’ve spent almost as much time writing this blog post explaining everything in detail as it took to actually build the system.

Provisioning servers is fast and easy because they barely need anything installed on them besides docker and a couple config files and scripts. If a machine crashes, the rest are unaffected and service is uninterrupted. Overall, I’m pretty happy with this setup. It’s better than the statically managed approach I had before (no more manually editing nginx configs and hoping I don’t introduce a syntax error that takes all the sites down until I fix it).

Nevertheless, what I’ve put together is basically a low rent, probably buggy version of a PaaS. I knew this going in. I did it anyway because I wanted to get a handle on all of this myself. (I’m also weird and enjoy this kind of thing). Now that I feel like I really understand the challenges here, when I get time, I’ll probably convert it all to run on Kubernetes or Mesos or something similar.

Tags: docker django devops continuous deployment consul