Shortest unused Twitter #hashtag

What is the shortest Twitter #hashtag that has never been used? As an additional constraint, let’s focus on the lexicographically first hashtag composed of ASCII letters only of that kind. Let’s skip dessert and use Python and the Twitter Search API to find out:

import urllib
import json
import itertools
import string
import time

k, max_k = 1, 10
while k < max_k:
    for tag in itertools.product(string.ascii_lowercase, repeat=k):
        tag = ''.join(tag)
        print "Searching for #%s" % tag
        search_url = 'https://search.twitter.com/search.json?q=%%23%s' % tag
        while True:
            search_result = json.loads(urllib.urlopen(search_url).read())
            if 'results' in search_result:
                break
            print "Wait a few seconds because of Twitter Search API limit"
            time.sleep(5)
        search_result = search_result['results']
        if not search_result:
            print "Unused hashtag: #%s" % tag
            k = max_k
            break
    k += 1

After a few minutes, the result: #agy. What could we use that one for?

Memory-efficient Django queries

When doing heavy computations involving the Django Object-Relational Mapper to access your database, you might notice Python consuming lots of memory. This will probably not happen during production web server mode, because dealing with lots of data to serve a single request already indicates that something is wrong and at least some preprocessing should be done. So that’s probably why Django isn’t tailored towards such needs, rather favoring speed (both at execution and coding time) at the cost of memory. But at least when you’re in that preprocessing phase or you’re just using the Django ORM for some scientific computation, you don’t want to consume more memory than absolutely necessary.

The first issue to watch out for is debug mode. From the docs:

It is also important to remember that when running with DEBUG turned on, Django will remember every SQL query it executes. This is useful when you are debugging, but on a production server, it will rapidly consume memory.

“Remembering” means that Django stores all SQL queries and their execution times in a list django.db.connection.queries of the form [{'sql': 'SELECT ...', 'time': '0.01'}, ]. This is useful when you want to profile your queries (see the docs), but needless overhead when you’re just doing heavy-database-access computations. So either set DEBUG to False or clear this list regularly in that case.

Moreover, it is important to understand how Django holds data in memory. Although the resulting objects are constructed “lazily” on the fly, the resulting rows of querysets are kept in memory by default so that multiple iterations on the result can use these cached versions. This can be avoided by using queryset.iterator(). So while

entries = Entry.objects.all()
for entry in entries:
    print entry
for entry in entries:
    print entry

will receive all entries from the database once and keep them in memory,

for entry in entries.iterator():
    print entry
for entry in entries.iterator():
    print entry

will execute the query twice but save memory.

However, even using iterator() can still lead to a heavy memory footprint, not directly on Django’s side, but from the database interface (e.g. the Python MySQLdb module). It will receive and store all the resulting data from the database server before even handing bits over to Django. So the only way to avoid this is to use queries that don’t produce too much data at once. This snippet does exactly that:

def queryset_iterator(queryset, chunksize=1000):
    pk = 0
    last_pk = queryset.order_by('-pk')[0].pk
    queryset = queryset.order_by('pk')
    while pk < last_pk:
        for row in queryset.filter(pk__gt=pk)[:chunksize]:
            pk = row.pk
            yield row

It basically receives slices (default size 1000) of the queryset ordered by the primary key. As a consequence, any ordering previously applied to the queryset is lost. The reason for this behavior is, of course, that ordering (and especially slicing) by primary key is fast. You can use the function like this:

for entry in queryset_iterator(Entry.objects):
    print entry

Apart from not being able to order the queryset, this approach does not handle concurrent modification of the data well: When rows are inserted or deleted while iterating over the dataset with queryset_iterator, rows might be reported several times or never. That should not be a big problem when preprocessing data though.

I modified queryset_iterator slightly to save the initial primary key query and to allow for non-numerical primary keys and also reverse ordering of primary keys:

def queryset_iterator(queryset, chunksize=1000, reverse=False):
    ordering = '-' if reverse else ''
    queryset = queryset.order_by(ordering + 'pk')
    last_pk = None
    new_items = True
    while new_items:
        new_items = False
        chunk = queryset
        if last_pk is not None:
            func = 'lt' if reverse else 'gt'
            chunk = chunk.filter(**{'pk__' + func: last_pk})
        chunk = chunk[:chunksize]
        row = None
        for row in chunk:
            yield row
        if row is not None:
            last_pk = row.pk
            new_items = True

To sum things up:

  • Set DEBUG = False when doing lots of database operations.
  • Use queryset_iterator when you deal with millions of rows and don’t care about ordering.
  • Still enjoy the convenience of Django!

Highlighting changes in LaTeX

I want to shortly point out how convenient it is in LaTeX to highlight changes you made to a document, let’s say for a journal resubmission. That is, you want to show your readers which parts you have added or deleted. In my opinion, coloring additions green, deletions red, and additionally marking edits with bars in the page margin is a good way of doing that.

That can be achieved pretty easily with two packages in LaTeX: xcolor and changebar.

\usepackage{xcolor}
\usepackage{changebar}

Using these, we can define the following commands to mark up additions and deletions:

\newcommand{\removed}[1]{\cbstart\removedfragile{#1}\cbend{}}
\newcommand{\removedfragile}[1]{{\color{red}{#1}}{}}
\newcommand{\added}[1]{\cbstart\addedfragile{#1}\cbend{}}
\newcommand{\addedfragile}[1]{{\color{green!50!black}{#1}}{}}
\newcommand{\changed}[2]{\added{#1}\removed{#2}}

At least from my experience, change bars don’t work in “fragile” environments such as float captions, that’s why there are versions of the commands that only color the edit. With this, we could already do

This \changed{new replacement}{text} is replaced.

to produce the above image. However, what if we want to see the final version only, without the change markup? Let’s define a new if

\newif\ifdiff
\difffalse

and then conditionally define our markup functions:

\ifdiff
  % the above definitions of removed* and added*
\else
  \newcommand{\removed}[1]{} % non-markup version
  \newcommand{\removedfragile}[1]{}
  \newcommand{\added}[1]{#1}
  \newcommand{\addedfragile}[1]{#1}
\fi

Now we can select whether we want to display the edits by setting diff to true or false. You and your reviewers will like it.

You can download the full source code of a sample page, the corresponding output with differences shown and without markup.

The last month

Time accelerates exponentially, it seems. I had three more weeks on campus before my sublease ended and I moved to Rick in Sonoma for another week, and it all went by so fast. I was mostly working on our JBI paper, but also enjoyed the sun and more steaks.

Eventually, people arrived on campus. Had some very funny evenings with a couple of new grad students, introduced them to the Austrian way of Åviechan. :) Thanks for your couch, Jonny!

On my last weekend, I went twice to the Hardly Strictly Bluegrass Festival, a free three-day festival in Golden Gate Park in the heart of San Francisco. So many funny, crazy people there! It seems that for some Californians, the 60s never ended. :) I particularly enjoyed the performance of Buckethead on Saturday, a rather non-bluegrassy mixture of jazz and hard rock—and yes, he really wears a bucket on his head while playing his various guitars.

I sold my bike—will miss you so much, dear blue beach cruiser! And when I left on Monday, something happened that hasn’t happened for a few months: It started raining.

Now all that’s left to say is, it’s been an awesome summer, goodbye, thanks for reading, and hope to see you soon! I’ll be back.

L.A. and San Diego

Fourth of September was Labor Day, so we (Pablo, Katja, Rossana, me) decided to use the long weekend and go to Los Angeles and San Diego. No trains in the U.S., so we took a flight to Long Beach and rented a car there. Spent some time at Venice Beach, where a lot of funny people do all sorts of funny things. Cruised around in Beverly Hills to finally reach our domicile for the night, the Budget Inn—it’s name says pretty much all about how it looked from outside. The room was okay though.

We didn’t stay long there, anyway. L.A. nightlife, here we come! I still don’t know whether the cab driver brought us to the club we actually wanted to go to, but it seemed he knew a lot about party, so we trusted him. The music there was better than I had expected. At one point, they forced people off the stage and did some photo and video shooting. No idea what exactly was going on or whom the fuzz was all about, but it must’ve been some sort of celebrity.

Oh, did I say “nightlife”? In fact, all bars, clubs, etc have to close at 2:00, so there’s no risk of having fun the whole night. Supermarkets and copy shops are still open at that time though.

After a yummy and way too big breakfast on Sunday, we made a short stop in Hollywood and continued driving to San Diego. This might seem surprising, but the Bristol was a little nicer than the Budget Inn.

Is it a sign of getting older when shopping Banana Republic (what a great name!) while others go to Forever 21? I hope not. At least I found a nice shirt there.

We went to Balboa Park and Coronado Island. Had the first few minutes of rain since I don’t know when. Nevertheless, had a great, relaxed time.

On Tuesday, we slowly made our way back to Los Angeles Airport, stopping at a few beaches to enjoy the sun. Eventually we returned the car and flew back. Nice trip.

Extreme Geeky Tidying Up: Sorting image colors with Python and Hilbert curves

There are levels of tidiness.

Tidy.
Very Tidy.
Totally Deranged Tidy.

Geeky Tidy.

After sharing these awesome pictures by the Swiss “artist” Ursus Wehrli with the world, Martin N. pointed out that there is still a severe issue with the Badewiese: People are not even sorted by the colors of their swimsuits! Adding to that, trees are placed in total chaos, not to speak of their leaves; clearly, there are tiny waves on the water; and one can almost see blades of grass not being sorted by size. Obviously, this picture needed further tidying up.


© Ursus Wehrli

Python to the rescue! 5 minutes, 10 lines, and half a beer later I had a program that would load the image, re-arrange its pixels, and save it again:

from PIL import Image

old = Image.open("tidy.jpg")
colors = old.getcolors(old.size[0] * old.size[1])
data = []
for count, color in colors:
    data.extend(count * [color])
data.sort()
new = Image.new('RGB', old.size)
new.putdata(data)
new.save("naive_tidy.png")

This yields the following picture:

So how was this “re-arrangement” performed? In a naïve approach, I just sorted the RGB values of the pixels, which is why there are many consecutive lines (of equal reds) with abrupt changes. Ordered, but far from being tidy.

Let’s try another color space and sort by HSV values—a smooth hue should look better, after all. So by adding

import colorsys

and changing one line to

data.sort(key=lambda rgb: colorsys.rgb_to_hsv(*rgb))

we get the following image:

More beautiful, but the red and white noise in the end doesn’t look tidy at all. We want a smooth transition through all given colors. Hilbert curves to the rescue! Found a ready-to-be-used Hilbert walk implementation, added

import hilbert

and changed the sorting once again to

data.sort(key=hilbert.Hilbert_to_int)

to get this image:

Mathematical computer science just made the world a little tidier.

To the North

After renting an awesome car—a black Hyundai Santa Fe—that would be big enough to let us sleep in it, Johanna and I continued our trip to the north. Again, we spent one night at one of Rick’s friends. This time it was Betsy in Chico, where we had actually already been two years ago. Still remember the huge steaks we got from the huge grill back then.

After that, we spent one night in Lassen Volcanic National Park. After a winter with over seven meters of snow at some places, they still have quite a lot of snow in the middle of summer! We hiked up above 3000 meters towards Lassen Peak, and we saw Bumpass Hell, a stinking but beautiful collection of mud pots and volcanic hot pools.

We continued our trip to the nice campground in McArthur-Burney Falls Memorial State Park and finally reached Lava Beds National Monument. Ages ago, huge streams of lava had created tube caves, when their top cooled and crusted while the inner part kept flowing.

Our companions on the road, among others: Johnny Cash (US), Pink Floyd, Muse (UK), Kruder & Dorfmeister (AT). And the Lowrider, of course.

Endless roads through forests. Desert. Lakes. Campfires. Togetherness. Happiness! :)