New challenges

Life has been pretty busy recently. I worked on my Google Summer of Code project implementing lattices in Sage; I continued working on several optimization projects, mostly dealing with the charging of multiple electric vehicles (in cooperation with FH Joanneum Kapfenberg and other companies); I took my last exams and finished my master’s thesis (“Optimization of a Purlin Punching Process”). I’m glad (and a little proud) that all of this worked out so well.

Adding to that, I organized my move to the U.S., where I finally landed and already began working for Wolfram Research. I am working as a Kernel Developer on Mathematica Online (or whatever it will be called once it gets released to the public). This is quite related to my previous work on Mathics.

Now, the sad news is that I have to stop actively coding on Mathics as a consequence. However, I see it in really good hands with Angus having already taken the lead recently, and he will certainly continue to keep Mathics flourish. Others are jumping in as well, so Mathics will not disappear, but grow further. Of course, I will still be there when assistance is needed, to merge in pull requests, and to keep the server running.

I’m really excited about the new challenges. I’ll share some of my experience on a special (rather personal) blog Go West, Young Jan. It’s in German so I don’t forget my mother’s tongue completely.

Let the adventure begin.

Mathics: Django in Sage-Python on nginx and gunicorn

I recently moved Mathics from my “general” web server that is also running tripedia.org, fairteiler.com, this site, and lots of other stuff, to its own virtual machine (a KVM). Mathics uses a lot of special stuff (e.g., not the regular Python, but Sage) and occasionally tends to freeze the whole machine, so it makes sense to separate it from the rest.

In the light of things to come (thinking of “live” collaboration and long polling) I thought Mathics should “go green” with the potential to serve it asynchronously. The Python WSGI HTTP server gunicorn seems like an obvious choice to serve Django applications this way, and it plays together nicely with the HTTP proxy server nginx. Everything runs on Debian Linux.

In addition, I installed Sage 5.0 (this is as simple as extracting the files to an arbitrary location, running make and waiting for a couple of hours). Because Sage’s version of Python (sage -python) will be used to run Django and gmail is used via SSL to send emails (registration/password and Django’s error emails, in particular), Sage has to be rebuilt with SSL support:

apt-get install libssl-dev
sage -f python

nginx is configured by adding a file to /etc/nginx/site-enabled containing something like this:

  upstream app_server {
    server unix:/tmp/gunicorn.sock fail_timeout=0;
  }

  server {
    listen 80 default;
    client_max_body_size 4G;
    server_name _;
    keepalive_timeout 5;

    location / {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_buffering off;
      proxy_pass http://app_server;
      break;
    }

    location /media/ {
      root /home/jan/static;
    }
  }

This will serve everything under /media/ from /home/jan/static (a symbolic link pointing to mathics/web/media), while all other URLs are passed to the app server which communicates with gunicorn via Unix sockets.

One way to deploy gunicorn is Circus, a process watcher and runner. It can be configured like the following:

[watcher:mathics]
cmd = sudo -u jan /home/jan/sage/sage -python /home/jan/sage/local/bin/gunicorn_django --bind unix:/tmp/gunicorn.sock --workers 4 /home/jan/mathics
working_dir = /home/jan/mathics
send_hup = true

This will execute gunicorn_django in the Mathics directory and listen to the corresponding Unix socket at startup. gunicorn will in turn fire up four workers.

I couldn’t figure out how to conveniently restart these Django worker processes. However, after simply killing them using

kill `ps -ef | grep "python /home/jan/sage/local/bin/gunicorn_django" | grep -v grep | awk '{print $2}'`

they will be restarted automatically, thereby loading any changes made to the Mathics source code.

Please let me know if you have any suggestions.

Alpha Maps: a mashup of Wolfram|Alpha and Google Maps

Hands down, Wolfram is awesome. Mathematica has been my favorite computer algebra system* from the start (okay, there were some wrong ways towards Maple in the very beginning, but hey, everybody makes mistakes); and Wolfram|Alpha is a big step (and still stepping) towards a huge semantic web.

Now I just came across the pretty new Wolfram|Alpha API and thought I had to do something with it. Having been a maps guy since the tripedia days, I had the idea of mashing up W|A with Google Maps.

A user can enter an arbitrary Wolfram|Alpha query (usually a city), the query is sent to W|A and the result is parsed on the client side. All items in the result that refer to geolocations are queried also to get a set of markers on the map. With nice info windows showing the data that W|A returned for them.

W|A sends quite a lot of information, but for now I was only interested in plain facts. These can be accessed using the plaintext field for every so-called “subpod”. Each of these plaintexts actually represents a table encoded by newlines and | separators. One could argue about the elegance of this representation in the API, but whatever.

The W|A API is a server-side-only API, as it doesn’t set the Access-Control-Allow-Origin HTTP header to *, so accessing it directly from any modern browser will violate the same-origin policy and thus fail. Therefore, a thin PHP wrapper on a server is needed. The good thing is that this wrapper can also cache queries to avoid reaching the free API limit of 2000 monthly calls too soon.

I don’t want to go into the details of the implementation here—it’s basically a bit of JavaScript using jQuery. If you’re interested, the code is online at github, and the resulting site is Alpha Maps.

It’s been a nice day of hacking. Any thoughts on how to turn this into something (even more) useful?

* I also like Sage, of course, and—surprise!—Mathics.

Converting text files to UTF-8

In a rather old project I’m working on again now, there used to be a lot of Latin-1-encoded files. Yuck! I don’t even want to know why anybody ever created or used a character encoding other than UTF-8. So I thought, let’s get these old-school files a decent encoding.

iconv can do the job:

iconv -f L1 -t UTF-8 filename >filename.converted

This will convert the file filename from Latin-1 to UTF-8 and save it as filename.converted.

To find all relevant files in the project directory, we use find, of course. The only issue with this is that a simple for x in `find ...` loop will not handle filenames containing spaces correctly, so we apply while read to it, as in:

find . -name '*.php' | while read x; #...

This will execute the rest with a variable x being assigned every PHP filename in the current directory. (There are other approaches to this as well, of course.)

Now there’s only one problem left to deal with: Some files in the directory are already UTF-8-encoded. Of course, we don’t want to re-encode them again. (Decoding from Latin-1 and encoding to UTF-8 is not idempotent for characters beyond ASCII.) There might be other solutions, but I decided to use Python and the chardet package to determine whether a file is already UTF-8-encoded:

import chardet
if chardet.detect(str)['encoding'].lower() == 'utf-8':
    print ('UTF-8')
else:
    print ('L1')

This will print UTF-8 if the string str is encoded in UTF-8 and L1 otherwise.

Adding some code to output the current file and to remove the original file and replace it by the converted one, we get the following script:

find . -name '*.php' | while read x; do
    e=$(python -c "import chardet; print ('UTF-8' if chardet.detect(file('$x').read())['encoding'].lower() == 'utf-8' else 'L1')")
    echo "converting $x: $e"
    iconv -f $e -t UTF-8 "$x" > "$x.utf8"
    rm "$x"
    mv "$x.utf8" "$x"
done

We can also assemble this into a bash one-liner if we prefer:

find . -name '*.php' | while read x; do e=$(python -c "import chardet; print ('UTF-8' if chardet.detect(file('$x').read())['encoding'].lower() == 'utf-8' else 'L1')"); echo "converting $x: $e"; iconv -f $e -t UTF-8 "$x" > "$x.utf8"; rm "$x"; mv "$x.utf8" "$x"; done

Shortest unused Twitter #hashtag

What is the shortest Twitter #hashtag that has never been used? As an additional constraint, let’s focus on the lexicographically first hashtag composed of ASCII letters only of that kind. Let’s skip dessert and use Python and the Twitter Search API to find out:

import urllib
import json
import itertools
import string
import time

k, max_k = 1, 10
while k < max_k:
    for tag in itertools.product(string.ascii_lowercase, repeat=k):
        tag = ''.join(tag)
        print "Searching for #%s" % tag
        search_url = 'https://search.twitter.com/search.json?q=%%23%s' % tag
        while True:
            search_result = json.loads(urllib.urlopen(search_url).read())
            if 'results' in search_result:
                break
            print "Wait a few seconds because of Twitter Search API limit"
            time.sleep(5)
        search_result = search_result['results']
        if not search_result:
            print "Unused hashtag: #%s" % tag
            k = max_k
            break
    k += 1

After a few minutes, the result: #agy. What could we use that one for?

Memory-efficient Django queries

When doing heavy computations involving the Django Object-Relational Mapper to access your database, you might notice Python consuming lots of memory. This will probably not happen during production web server mode, because dealing with lots of data to serve a single request already indicates that something is wrong and at least some preprocessing should be done. So that’s probably why Django isn’t tailored towards such needs, rather favoring speed (both at execution and coding time) at the cost of memory. But at least when you’re in that preprocessing phase or you’re just using the Django ORM for some scientific computation, you don’t want to consume more memory than absolutely necessary.

The first issue to watch out for is debug mode. From the docs:

It is also important to remember that when running with DEBUG turned on, Django will remember every SQL query it executes. This is useful when you are debugging, but on a production server, it will rapidly consume memory.

“Remembering” means that Django stores all SQL queries and their execution times in a list django.db.connection.queries of the form [{'sql': 'SELECT ...', 'time': '0.01'}, ]. This is useful when you want to profile your queries (see the docs), but needless overhead when you’re just doing heavy-database-access computations. So either set DEBUG to False or clear this list regularly in that case.

Moreover, it is important to understand how Django holds data in memory. Although the resulting objects are constructed “lazily” on the fly, the resulting rows of querysets are kept in memory by default so that multiple iterations on the result can use these cached versions. This can be avoided by using queryset.iterator(). So while

entries = Entry.objects.all()
for entry in entries:
    print entry
for entry in entries:
    print entry

will receive all entries from the database once and keep them in memory,

for entry in entries.iterator():
    print entry
for entry in entries.iterator():
    print entry

will execute the query twice but save memory.

However, even using iterator() can still lead to a heavy memory footprint, not directly on Django’s side, but from the database interface (e.g. the Python MySQLdb module). It will receive and store all the resulting data from the database server before even handing bits over to Django. So the only way to avoid this is to use queries that don’t produce too much data at once. This snippet does exactly that:

def queryset_iterator(queryset, chunksize=1000):
    pk = 0
    last_pk = queryset.order_by('-pk')[0].pk
    queryset = queryset.order_by('pk')
    while pk < last_pk:
        for row in queryset.filter(pk__gt=pk)[:chunksize]:
            pk = row.pk
            yield row

It basically receives slices (default size 1000) of the queryset ordered by the primary key. As a consequence, any ordering previously applied to the queryset is lost. The reason for this behavior is, of course, that ordering (and especially slicing) by primary key is fast. You can use the function like this:

for entry in queryset_iterator(Entry.objects):
    print entry

Apart from not being able to order the queryset, this approach does not handle concurrent modification of the data well: When rows are inserted or deleted while iterating over the dataset with queryset_iterator, rows might be reported several times or never. That should not be a big problem when preprocessing data though.

I modified queryset_iterator slightly to save the initial primary key query and to allow for non-numerical primary keys and also reverse ordering of primary keys:

def queryset_iterator(queryset, chunksize=1000, reverse=False):
    ordering = '-' if reverse else ''
    queryset = queryset.order_by(ordering + 'pk')
    last_pk = None
    new_items = True
    while new_items:
        new_items = False
        chunk = queryset
        if last_pk is not None:
            func = 'lt' if reverse else 'gt'
            chunk = chunk.filter(**{'pk__' + func: last_pk})
        chunk = chunk[:chunksize]
        row = None
        for row in chunk:
            yield row
        if row is not None:
            last_pk = row.pk
            new_items = True

To sum things up:

  • Set DEBUG = False when doing lots of database operations.
  • Use queryset_iterator when you deal with millions of rows and don’t care about ordering.
  • Still enjoy the convenience of Django!

Highlighting changes in LaTeX

I want to shortly point out how convenient it is in LaTeX to highlight changes you made to a document, let’s say for a journal resubmission. That is, you want to show your readers which parts you have added or deleted. In my opinion, coloring additions green, deletions red, and additionally marking edits with bars in the page margin is a good way of doing that.

That can be achieved pretty easily with two packages in LaTeX: xcolor and changebar.

\usepackage{xcolor}
\usepackage{changebar}

Using these, we can define the following commands to mark up additions and deletions:

\newcommand{\removed}[1]{\cbstart\removedfragile{#1}\cbend{}}
\newcommand{\removedfragile}[1]{{\color{red}{#1}}{}}
\newcommand{\added}[1]{\cbstart\addedfragile{#1}\cbend{}}
\newcommand{\addedfragile}[1]{{\color{green!50!black}{#1}}{}}
\newcommand{\changed}[2]{\added{#1}\removed{#2}}

At least from my experience, change bars don’t work in “fragile” environments such as float captions, that’s why there are versions of the commands that only color the edit. With this, we could already do

This \changed{new replacement}{text} is replaced.

to produce the above image. However, what if we want to see the final version only, without the change markup? Let’s define a new if

\newif\ifdiff
\difffalse

and then conditionally define our markup functions:

\ifdiff
  % the above definitions of removed* and added*
\else
  \newcommand{\removed}[1]{} % non-markup version
  \newcommand{\removedfragile}[1]{}
  \newcommand{\added}[1]{#1}
  \newcommand{\addedfragile}[1]{#1}
\fi

Now we can select whether we want to display the edits by setting diff to true or false. You and your reviewers will like it.

You can download the full source code of a sample page, the corresponding output with differences shown and without markup.

Extreme Geeky Tidying Up: Sorting image colors with Python and Hilbert curves

There are levels of tidiness.

Tidy.
Very Tidy.
Totally Deranged Tidy.

Geeky Tidy.

After sharing these awesome pictures by the Swiss “artist” Ursus Wehrli with the world, Martin N. pointed out that there is still a severe issue with the Badewiese: People are not even sorted by the colors of their swimsuits! Adding to that, trees are placed in total chaos, not to speak of their leaves; clearly, there are tiny waves on the water; and one can almost see blades of grass not being sorted by size. Obviously, this picture needed further tidying up.


© Ursus Wehrli

Python to the rescue! 5 minutes, 10 lines, and half a beer later I had a program that would load the image, re-arrange its pixels, and save it again:

from PIL import Image

old = Image.open("tidy.jpg")
colors = old.getcolors(old.size[0] * old.size[1])
data = []
for count, color in colors:
    data.extend(count * [color])
data.sort()
new = Image.new('RGB', old.size)
new.putdata(data)
new.save("naive_tidy.png")

This yields the following picture:

So how was this “re-arrangement” performed? In a naïve approach, I just sorted the RGB values of the pixels, which is why there are many consecutive lines (of equal reds) with abrupt changes. Ordered, but far from being tidy.

Let’s try another color space and sort by HSV values—a smooth hue should look better, after all. So by adding

import colorsys

and changing one line to

data.sort(key=lambda rgb: colorsys.rgb_to_hsv(*rgb))

we get the following image:

More beautiful, but the red and white noise in the end doesn’t look tidy at all. We want a smooth transition through all given colors. Hilbert curves to the rescue! Found a ready-to-be-used Hilbert walk implementation, added

import hilbert

and changed the sorting once again to

data.sort(key=hilbert.Hilbert_to_int)

to get this image:

Mathematical computer science just made the world a little tidier.

Death and reincarnation of some bitten fruit

Usually, when I turn on my computer, I don’t spend a second thinking about whether it will actually turn on. From now on I probably will.

I just wanted to resume from sleep (which is the usual mode of non-operation for it), but the screen remained black and nothing happened. That’s not totally unusual, as sometimes something in the computer decides it doesn’t want to wake up, in which case you have to turn it off and on again. That’s what I did, but in this unique case, it still remained black. The front light was on and their was some noise from the fan, but nothing more. No Mac startup sound, which I never particularly liked, but in this case would have loved to hear.

Letting it rest for a while, removing the battery, resetting the PRAM, switching RAM modules, nothing helped. Finally, I made an appointment with the Apple Genius Bar, conveniently located a few blocks away in the Stanford Shopping Center.

After a few experiments by my nice “genius,” he told me it would be $310 to have my notebook (a mid-2008 15″ MacBook Pro 2.4 GHz Intel Core 2 Duo with 4 GB RAM running Leopard, by the way) repaired. Not too cheap, but I would have paid anything to be able to work on my stuff again. Luckily though, the guy continued his research and finally told me the repair would be free—even though warranty had expired about two years ago!

Apparently, it was another case of this well-known graphics issue. Actually, I had occasionally noticed some of the described symptoms and was aware of this issue, I just had not thought that it could also cause the computer not to start up at all.

Even better, the guy told me I could probably get my computer back in one or two days—which would have been a Sunday, amazingly enough. And I would not have to worry about my data, the hard drive was safe.

As if that had not been good enough, they called me on the same day at 6.30 pm that my computer was ready for pickup! (I had been in the store at 10.45 am before.) I got there at around 8 pm (yes, my Austrian friends, they’re still open at that time, usually even on Sundays!) and got back my ready-to-rock shiny-new MacBook again. Apparently, they had even polished it.

One could argue that this problem shouldn’t have occurred in the first place, but still, well done, Apple, awesome support! Even if I had concerns about Apple’s policy recently (and not only recently), right now, I love you again.

Had some good steaks to celebrate this quick reincarnation.