Continuously Deploying Django with Docker
By anders pearson 06 Dec 2015
I run about a dozen personal Django applications (including this site) on some small servers that I admin. I also run a half dozen or so applications written in other languages and other frameworks.
Since it’s a heterogeneous setup and I have a limited amount of free time for side projects, container technology like Docker that lets me standardize my production deployment is quite appealing.
I run a continuous deployment pipeline for all of these applications so every git commit I make to master goes through a test pipeline and ends up deployed to the production servers (assuming all the tests pass).
Getting Django to work smoothly in a setup like this is non-trivial. This post attempts to explain how I have it all working.
Background
First, some background on my setup. I run Ubuntu 14.04 servers on Digital Ocean. Ubuntu 14.04 still uses upstart as the default init, so that’s what I use to manage my application processes. I back the apps with Postgres and I run an Nginx proxy in front of them. I serve static assets via S3 and Cloudfront. I also use Salt for config management and provisioning so if some of the config files here look a bit tedious or tricky to maintain and keep in sync, keep in mind that I’m probably actually using Salt to template and automate them. I also have a fairly extensive monitoring setup that I won’t go into here, but will generally let me know as soon as anything goes wrong.
I currently have three “application” servers where the django applications themselves run. Typically I run each application on two servers which Nginx load balances between. A few of the applications also use Celery for background jobs and Celery Beat for periodic tasks. For those, the celery and celery beat processes run on the third application server.
My goal for my setup was to be able to deploy new versions of my
Django apps automatically and safely just by doing git push origin master
(which typically pushes to a github repo). That means that the
code needs to be tested, a new Docker image needs to be built,
distributed to the application servers, database migrations run,
static assets compiled and pushed to S3, and the new version of the
application started in place of the old. Preferably without any
downtime for the users.
I’ll walk through the setup for my web-based feedreader app, antisocial, since it is one of the ones with Celery processes. Other apps are all basically the same except they might not need the Celery parts.
I should also point out that I am perpetually tweaking stuff. This is what I’m doing at the moment, but it will probably outdated soon after I publish this as I find other things to improve.
Dockerfile
Let’s start with the Dockerfile
:
Dockerfile:
FROM ccnmtl/django.base
ADD wheelhouse /wheelhouse
RUN apt-get update && apt-get install -y libxml2-dev libxslt-dev
RUN /ve/bin/pip install --no-index -f /wheelhouse -r /wheelhouse/requirements.txt
WORKDIR /app
COPY . /app/
RUN /ve/bin/python manage.py test
EXPOSE 8000
ADD docker-run.sh /run.sh
ENV APP antisocial
ENTRYPOINT ["/run.sh"]
CMD ["run"]
Like most, I started using Docker by doing FROM ubuntu:trusty
or
something similar at the beginning of all my Dockerfiles. That’s not
really ideal though and results in large docker images that are slow
to work with so I’ve been trying to get my docker images as
slim and minimal as possible lately.
Roughly following
Glyph’s approach,
I split the docker image build process into a base image and a
“builder” image so the final image can be constructed without the
whole compiler toolchain included. The base and builder images I have
published as ccnmtl/django.base
and ccnmtl/django.build
respectively and you can see exactly how they are made
here.
Essentially, they both are built on top of Debian Jessie (quite a bit
smaller than Ubuntu images and similar enough). The base image
contains the bare minimum while the build image contains a whole
toolchain for building wheels out of python libraries. I have a
Makefile
with some bits like this:
ROOT_DIR:=$(shell dirname $(realpath $(lastword $(MAKEFILE_LIST))))
APP=antisocial
REPO=thraxil
WHEELHOUSE=wheelhouse
$(WHEELHOUSE)/requirements.txt: $(REQUIREMENTS)
mkdir -p $(WHEELHOUSE)
docker run --rm \
-v $(ROOT_DIR):/app \
-v $(ROOT_DIR)/$(WHEELHOUSE):/wheelhouse \
ccnmtl/django.build
cp $(REQUIREMENTS) $(WHEELHOUSE)/requirements.txt
touch $(WHEELHOUSE)/requirements.txt
build: $(WHEELHOUSE)/requirements.txt
docker build -t $(IMAGE) .
So when I do make build
, if the requirements.txt
has changed since
the last time, it uses the build image to generate a directory with
wheels for every library specified in requirements.txt
, then runs
docker build
, which can do a very simple (and fast) pip install
of
those wheels.
Once the requirements are installed, it runs the application’s unit tests. I expose port 8000 and copy in a custom script to use as an entry point.
docker-run.sh
That script makes the container a bit easier to work with. It looks like this:
#!/bin/bash
cd /app/
if [[ "$SETTINGS" ]]; then
export DJANGO_SETTINGS_MODULE="$APP.$SETTINGS"
else
export DJANGO_SETTINGS_MODULE="$APP.settings_docker"
fi
if [ "$1" == "migrate" ]; then
exec /ve/bin/python manage.py migrate --noinput
fi
if [ "$1" == "collectstatic" ]; then
exec /ve/bin/python manage.py collectstatic --noinput
fi
if [ "$1" == "compress" ]; then
exec /ve/bin/python manage.py compress
fi
if [ "$1" == "shell" ]; then
exec /ve/bin/python manage.py shell
fi
if [ "$1" == "worker" ]; then
exec /ve/bin/python manage.py celery worker
fi
if [ "$1" == "beat" ]; then
exec /ve/bin/python manage.py celery beat
fi
# run arbitrary commands
if [ "$1" == "manage" ]; then
shift
exec /ve/bin/python manage.py "$@"
fi
if [ "$1" == "run" ]; then
exec /ve/bin/gunicorn --env \
DJANGO_SETTINGS_MODULE=$DJANGO_SETTINGS_MODULE \
$APP.wsgi:application -b 0.0.0.0:8000 -w 3 \
--access-logfile=- --error-logfile=-
fi
With the ENTRYPOINT
and CMD
set up that way in the Dockerfile
, I
can just run
$ docker run thraxil/antisocial
and it will run the gunicorn process, serving the app on port 8000. Or, I can do:
$ docker run thraxil/antisocial migrate
and it will run the database migration task. Similar for
collectstatic
, compress
, celery
, etc. Or, I can do:
$ docker run thraxil/antisocial manage some_other_command --with-flags
to run any other Django manage.py
command (this is really handy for
dealing with migrations that need to be faked out, etc.)
docker-runner
Of course, all of those exact commands would run into problems with needing various environment variables passed in, etc.
The settings_docker
settings module that the script defaults to for
the container is a fairly standard Django settings file, except that
it pulls all the required settings out of environment variables. The
bulk of it comes from a common library that you can see
here.
This gives us a nice twelve-factor style setup and keeps the docker containers very generic and reusable. If someone else wants to run one of these applications, they can pretty easily just run the same container and just give it their own environment variables.
The downside though is that it gets a bit painful to actually run
things from the commandline, particularly for one-off tasks like
database migrations because you actually need to specify a dozen or so
-e
flags on every command.
I cooked up a little bit of shell script with a dash of convention over configuration to ease that pain.
All the servers get a simple docker-runner
script that looks like:
#!/bin/bash
APP=$1
shift
IMAGE=
OPTS=
if [ -f /etc/default/$APP ]; then
. /etc/default/$APP
fi
TAG="latest"
if [ -f /var/www/$APP/TAG ]; then
. /var/www/$APP/TAG
fi
exec /usr/bin/docker run $OPTS $EXTRA $IMAGE:$TAG $@
That expects that every app has a file in /etc/default
that defines
an $IMAGE
and $OPTS
variable. Eg, antisocial’s looks something
like:
/etc/default/antisocial
:
export IMAGE="thraxil/antisocial"
export OPTS="--link postfix:postfix \
--rm \
-e SECRET_KEY=some_secret_key \
-e AWS_S3_CUSTOM_DOMAIN=d115djs1mf98us.cloudfront.net \
-e AWS_STORAGE_BUCKET_NAME=s3-bucket-name \
-e AWS_ACCESS_KEY=... \
-e AWS_SECRET_KEY=... \
... more settings ... \
-e ALLOWED_HOSTS=.thraxil.org \
-e BROKER_URL=amqp://user:pass@host:5672//antisocial"
With that in place, I can just do:
$ docker-runner antisocial migrate
And it fills everything in. So I can keep the common options in one place and not have to type them in every time.
(I’ll get to the TAG
file that it mentions in a bit)
upstart
With those in place, the upstart config for the application can be fairly simple:
/etc/init/antisocial.conf
:
description "start/stop antisocial docker"
start on filesystem and started docker-postfix and started registrator
stop on runlevel [!2345]
respawn
script
export EXTRA="-e SERVICE_NAME=antisocial -p 192.81.1.1::8000"
exec /usr/local/bin/docker-runner antisocial
end script
The Celery and Celery Beat services have very similar ones except they
run celery
and beat
tasks instead and they don’t need to have a
SERVICE_NAME
set or ports configured.
Consul
Next, I use consul, consul-template, and registrator to rig everything up so Nginx automatically proxies to the appropriate ports on the appropriate application servers.
Each app is registered as a service (hence the SERVICE_NAME
parameter in the upstart config). Registrator sees the containers
starting and stopping and registers and deregisters them with consul
as appropriate, inspecting them to get the IP and port info.
consul-template runs on the Nginx server and has a template defined for each app that looks something like:
{{if service "antisocial"}}
upstream antisocial {
{{range service "antisocial"}} server {{.Address}}:{{.Port}};
{{end}}
}
{{end}}
server {
listen 80;
server_name feeds.thraxil.org;
client_max_body_size 40M;
{{if service "antisocial"}}
location / {
proxy_pass http://antisocial;
proxy_next_upstream error timeout invalid_header http_500;
proxy_connect_timeout 2;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
error_page 502 /502.html;
location = /502.html {
root /var/www/down/;
}
{{else}}
root /var/www/down/;
try_files $uri /502.html;
{{end}}
}
That just dynamically creates an endpoint for each running app instance pointing to the right IP and port. Nginx then round-robins between them. If none are running, it changes it out to serve a “sorry, the site is down” kind of page instead. Consul-template updates the nginx config and reloads nginx as soon as any changes are seen to the service. It’s really nice. If I need more instances of a particular app running, I can just spin one up on another server and it instantly gets added to the pool. If one crashes or is shut down, it’s removed just as quickly. As long as one there’s at least one instance running at any given time, visitors to the site should never be any the wiser (as long as it can handle the current traffic).
That really covers the server and application setup.
What’s left is the deployment part. Ie, how it gets from a new commit on master to running on the application servers.
Jenkins
Jenkins is kind of a no-brainer for CI/CD stuff. I could probably rig something similar up with TravisCI or Wercker or another hosted CI, but I’m more comfortable keeping my credentials on my own servers for now.
So I have a Jenkins server running and I have a job set up there for each application. It gets triggered by a webhook from github whenever there’s a commit to master.
Jenkins checks out the code and runs:
export TAG=build-$BUILD_NUMBER
make build
docker push thraxil/antisocial:$TAG
$BUILD_NUMBER
is a built-in environment variable that Jenkins sets
on each build. So it’s just building a new docker image (which runs
the test suite as part of the build process) and pushes it to the
Docker Hub with a unique tag corresponding to this build.
When that completes successfully, it triggers a downstream Jenkins job
called django-deploy
which is a
parameterized build. It
passes it the following parameters:
APP=antisocial
TAG=build-$BUILD_NUMBER
HOSTS=appserver1 appserver2
CELERY_HOSTS=appserver3
BEAT_HOSTS=appserver3
These are fairly simple apps that I run mostly for my own amusement so
I don’t have extensive integration tests. If I did, instead of
triggering django-deploy
directly here, I would trigger other jobs
to run those tests against the newly created and tagged image first.
The django-deploy
job runs the following script:
#!/bin/bash
hosts=(${HOSTS})
chosts=(${CELERY_HOSTS})
bhosts=(${BEAT_HOSTS})
for h in "${hosts[@]}"
do
ssh $h docker pull ${REPOSITORY}thraxil/$APP:$TAG
ssh $h cp /var/www/$APP/TAG /var/www/$APP/REVERT || true
ssh $h "echo export TAG=$TAG > /var/www/$APP/TAG"
done
for h in "${chosts[@]}"
do
ssh $h docker pull ${REPOSITORY}thraxil/$APP:$TAG
ssh $h cp /var/www/$APP/TAG /var/www/$APP/REVERT || true
ssh $h "echo export TAG=$TAG > /var/www/$APP/TAG"
done
for h in "${bhosts[@]}"
do
ssh $h docker pull ${REPOSITORY}thraxil/$APP:$TAG
ssh $h cp /var/www/$APP/TAG /var/www/$APP/REVERT || true
ssh $h "echo export TAG=$TAG > /var/www/$APP/TAG"
done
h=${hosts[0]}
ssh $h /usr/local/bin/docker-runner $APP migrate
ssh $h /usr/local/bin/docker-runner $APP collectstatic
ssh $h /usr/local/bin/docker-runner $APP compress
for h in "${hosts[@]}"
do
ssh $h sudo stop $APP || true
ssh $h sudo start $APP
done
for h in "${chosts[@]}"
do
ssh $h sudo stop $APP-worker || true
ssh $h sudo start $APP-worker
done
for h in "${bhosts[@]}"
do
ssh $h sudo stop $APP-beat || true
ssh $h sudo start $APP-beat
done
It’s a bit long, but straightforward.
First, it just pulls the new docker image down onto each server. This
is done first because the docker pull is usually the slowest part of
the process. Might as well get it out of the way first. On each host,
it also writes to the /var/www/$APP/TAG
file that we saw mentioned
back in docker-runner
. The contents are just a variable assignment
specifying the tag that we just built and pulled and are about to cut
over to. The docker-runner
script knows to use the specific tag
that’s set in that file. Of course, it first backs up the old one to a
REVERT
file that can then be used to easily roll-back the whole
deploy if something goes wrong.
Next, the database migrations and static asset tasks have to run. They
only need to run on a single host though, so it just pulls the first
one off the list and runs the migrate
, collectstatic
, and
compress
on that one.
Finally, it goes host by host and stops and starts the service on each in turn. Remember that with the whole consul setup, as long as they aren’t all shut off at the same time, overall availability should be preserved.
Then, of course, it does the same thing for the celery
and celery beat
services.
If that all completes successfully, it’s done. If it fails somewhere
along the way, another Jenkins job is triggered that basically
restores the TAG
file from REVERT
and restarts the services,
putting everything back to the previous version.
Conclusion and Future Directions
That’s a lot to digest. Sorry. In practice, it really doesn’t feel
that complicated. Mostly stuff just works and I don’t have to think
about it. I write code, commit, and git push
. A few minutes later I
get an email from Jenkins telling me that it’s been
deployed. Occasionally, Jenkins tells me that I broke something and I
go investigate and fix it (while the site stays up). If I need more
capacity, I provision a new server and it joins the consul
cluster. Then I can add it to the list to deploy to, kick off a Jenkins
job and it’s running. I’ve spent almost as much time writing this blog
post explaining everything in detail as it took to actually build the
system.
Provisioning servers is fast and easy because they barely need anything installed on them besides docker and a couple config files and scripts. If a machine crashes, the rest are unaffected and service is uninterrupted. Overall, I’m pretty happy with this setup. It’s better than the statically managed approach I had before (no more manually editing nginx configs and hoping I don’t introduce a syntax error that takes all the sites down until I fix it).
Nevertheless, what I’ve put together is basically a low rent, probably buggy version of a PaaS. I knew this going in. I did it anyway because I wanted to get a handle on all of this myself. (I’m also weird and enjoy this kind of thing). Now that I feel like I really understand the challenges here, when I get time, I’ll probably convert it all to run on Kubernetes or Mesos or something similar.