Mathics: Django in Sage-Python on nginx and gunicorn

I recently moved Mathics from my “general” web server that is also running tripedia.org, fairteiler.com, this site, and lots of other stuff, to its own virtual machine (a KVM). Mathics uses a lot of special stuff (e.g., not the regular Python, but Sage) and occasionally tends to freeze the whole machine, so it makes sense to separate it from the rest.

In the light of things to come (thinking of “live” collaboration and long polling) I thought Mathics should “go green” with the potential to serve it asynchronously. The Python WSGI HTTP server gunicorn seems like an obvious choice to serve Django applications this way, and it plays together nicely with the HTTP proxy server nginx. Everything runs on Debian Linux.

In addition, I installed Sage 5.0 (this is as simple as extracting the files to an arbitrary location, running make and waiting for a couple of hours). Because Sage’s version of Python (sage -python) will be used to run Django and gmail is used via SSL to send emails (registration/password and Django’s error emails, in particular), Sage has to be rebuilt with SSL support:

apt-get install libssl-dev
sage -f python

nginx is configured by adding a file to /etc/nginx/site-enabled containing something like this:

  upstream app_server {
    server unix:/tmp/gunicorn.sock fail_timeout=0;
  }

  server {
    listen 80 default;
    client_max_body_size 4G;
    server_name _;
    keepalive_timeout 5;

    location / {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_buffering off;
      proxy_pass http://app_server;
      break;
    }

    location /media/ {
      root /home/jan/static;
    }
  }

This will serve everything under /media/ from /home/jan/static (a symbolic link pointing to mathics/web/media), while all other URLs are passed to the app server which communicates with gunicorn via Unix sockets.

One way to deploy gunicorn is Circus, a process watcher and runner. It can be configured like the following:

[watcher:mathics]
cmd = sudo -u jan /home/jan/sage/sage -python /home/jan/sage/local/bin/gunicorn_django --bind unix:/tmp/gunicorn.sock --workers 4 /home/jan/mathics
working_dir = /home/jan/mathics
send_hup = true

This will execute gunicorn_django in the Mathics directory and listen to the corresponding Unix socket at startup. gunicorn will in turn fire up four workers.

I couldn’t figure out how to conveniently restart these Django worker processes. However, after simply killing them using

kill `ps -ef | grep "python /home/jan/sage/local/bin/gunicorn_django" | grep -v grep | awk '{print $2}'`

they will be restarted automatically, thereby loading any changes made to the Mathics source code.

Please let me know if you have any suggestions.

Memory-efficient Django queries

When doing heavy computations involving the Django Object-Relational Mapper to access your database, you might notice Python consuming lots of memory. This will probably not happen during production web server mode, because dealing with lots of data to serve a single request already indicates that something is wrong and at least some preprocessing should be done. So that’s probably why Django isn’t tailored towards such needs, rather favoring speed (both at execution and coding time) at the cost of memory. But at least when you’re in that preprocessing phase or you’re just using the Django ORM for some scientific computation, you don’t want to consume more memory than absolutely necessary.

The first issue to watch out for is debug mode. From the docs:

It is also important to remember that when running with DEBUG turned on, Django will remember every SQL query it executes. This is useful when you are debugging, but on a production server, it will rapidly consume memory.

“Remembering” means that Django stores all SQL queries and their execution times in a list django.db.connection.queries of the form [{'sql': 'SELECT ...', 'time': '0.01'}, ]. This is useful when you want to profile your queries (see the docs), but needless overhead when you’re just doing heavy-database-access computations. So either set DEBUG to False or clear this list regularly in that case.

Moreover, it is important to understand how Django holds data in memory. Although the resulting objects are constructed “lazily” on the fly, the resulting rows of querysets are kept in memory by default so that multiple iterations on the result can use these cached versions. This can be avoided by using queryset.iterator(). So while

entries = Entry.objects.all()
for entry in entries:
    print entry
for entry in entries:
    print entry

will receive all entries from the database once and keep them in memory,

for entry in entries.iterator():
    print entry
for entry in entries.iterator():
    print entry

will execute the query twice but save memory.

However, even using iterator() can still lead to a heavy memory footprint, not directly on Django’s side, but from the database interface (e.g. the Python MySQLdb module). It will receive and store all the resulting data from the database server before even handing bits over to Django. So the only way to avoid this is to use queries that don’t produce too much data at once. This snippet does exactly that:

def queryset_iterator(queryset, chunksize=1000):
    pk = 0
    last_pk = queryset.order_by('-pk')[0].pk
    queryset = queryset.order_by('pk')
    while pk < last_pk:
        for row in queryset.filter(pk__gt=pk)[:chunksize]:
            pk = row.pk
            yield row

It basically receives slices (default size 1000) of the queryset ordered by the primary key. As a consequence, any ordering previously applied to the queryset is lost. The reason for this behavior is, of course, that ordering (and especially slicing) by primary key is fast. You can use the function like this:

for entry in queryset_iterator(Entry.objects):
    print entry

Apart from not being able to order the queryset, this approach does not handle concurrent modification of the data well: When rows are inserted or deleted while iterating over the dataset with queryset_iterator, rows might be reported several times or never. That should not be a big problem when preprocessing data though.

I modified queryset_iterator slightly to save the initial primary key query and to allow for non-numerical primary keys and also reverse ordering of primary keys:

def queryset_iterator(queryset, chunksize=1000, reverse=False):
    ordering = '-' if reverse else ''
    queryset = queryset.order_by(ordering + 'pk')
    last_pk = None
    new_items = True
    while new_items:
        new_items = False
        chunk = queryset
        if last_pk is not None:
            func = 'lt' if reverse else 'gt'
            chunk = chunk.filter(**{'pk__' + func: last_pk})
        chunk = chunk[:chunksize]
        row = None
        for row in chunk:
            yield row
        if row is not None:
            last_pk = row.pk
            new_items = True

To sum things up:

  • Set DEBUG = False when doing lots of database operations.
  • Use queryset_iterator when you deal with millions of rows and don’t care about ordering.
  • Still enjoy the convenience of Django!

TwitterExplorer

Finished my work on TwitterExplorer (as far as work can ever be finished). It now features a huge, interactive clustered network of Twitter #hashtags. There are detailed statistics, timelines, and a classification (attempt) for each hashtag, plus a separate clustered network of the 3-hop network around it.

This is one more Python/Django application (for details, see the about page). Especially the interactive graph involved a lot of JavaScript tweaks with the JavaScript InfoVis Toolkit, although I’m not really sure whether it was really worth using it—in the end, I might have needed less code by starting from scratch, using just jQuery and and the HTML5 canvas element.

This is also the first time I placed a flattr button somewhere. Let’s see how much it generates—I don’t have any expectations.