Converting text files to UTF-8

In a rather old project I’m working on again now, there used to be a lot of Latin-1-encoded files. Yuck! I don’t even want to know why anybody ever created or used a character encoding other than UTF-8. So I thought, let’s get these old-school files a decent encoding.

iconv can do the job:

iconv -f L1 -t UTF-8 filename >filename.converted

This will convert the file filename from Latin-1 to UTF-8 and save it as filename.converted.

To find all relevant files in the project directory, we use find, of course. The only issue with this is that a simple for x in `find ...` loop will not handle filenames containing spaces correctly, so we apply while read to it, as in:

find . -name '*.php' | while read x; #...

This will execute the rest with a variable x being assigned every PHP filename in the current directory. (There are other approaches to this as well, of course.)

Now there’s only one problem left to deal with: Some files in the directory are already UTF-8-encoded. Of course, we don’t want to re-encode them again. (Decoding from Latin-1 and encoding to UTF-8 is not idempotent for characters beyond ASCII.) There might be other solutions, but I decided to use Python and the chardet package to determine whether a file is already UTF-8-encoded:

import chardet
if chardet.detect(str)['encoding'].lower() == 'utf-8':
    print ('UTF-8')
else:
    print ('L1')

This will print UTF-8 if the string str is encoded in UTF-8 and L1 otherwise.

Adding some code to output the current file and to remove the original file and replace it by the converted one, we get the following script:

find . -name '*.php' | while read x; do
    e=$(python -c "import chardet; print ('UTF-8' if chardet.detect(file('$x').read())['encoding'].lower() == 'utf-8' else 'L1')")
    echo "converting $x: $e"
    iconv -f $e -t UTF-8 "$x" > "$x.utf8"
    rm "$x"
    mv "$x.utf8" "$x"
done

We can also assemble this into a bash one-liner if we prefer:

find . -name '*.php' | while read x; do e=$(python -c "import chardet; print ('UTF-8' if chardet.detect(file('$x').read())['encoding'].lower() == 'utf-8' else 'L1')"); echo "converting $x: $e"; iconv -f $e -t UTF-8 "$x" > "$x.utf8"; rm "$x"; mv "$x.utf8" "$x"; done

Shortest unused Twitter #hashtag

What is the shortest Twitter #hashtag that has never been used? As an additional constraint, let’s focus on the lexicographically first hashtag composed of ASCII letters only of that kind. Let’s skip dessert and use Python and the Twitter Search API to find out:

import urllib
import json
import itertools
import string
import time

k, max_k = 1, 10
while k < max_k:
    for tag in itertools.product(string.ascii_lowercase, repeat=k):
        tag = ''.join(tag)
        print "Searching for #%s" % tag
        search_url = 'https://search.twitter.com/search.json?q=%%23%s' % tag
        while True:
            search_result = json.loads(urllib.urlopen(search_url).read())
            if 'results' in search_result:
                break
            print "Wait a few seconds because of Twitter Search API limit"
            time.sleep(5)
        search_result = search_result['results']
        if not search_result:
            print "Unused hashtag: #%s" % tag
            k = max_k
            break
    k += 1

After a few minutes, the result: #agy. What could we use that one for?

Memory-efficient Django queries

When doing heavy computations involving the Django Object-Relational Mapper to access your database, you might notice Python consuming lots of memory. This will probably not happen during production web server mode, because dealing with lots of data to serve a single request already indicates that something is wrong and at least some preprocessing should be done. So that’s probably why Django isn’t tailored towards such needs, rather favoring speed (both at execution and coding time) at the cost of memory. But at least when you’re in that preprocessing phase or you’re just using the Django ORM for some scientific computation, you don’t want to consume more memory than absolutely necessary.

The first issue to watch out for is debug mode. From the docs:

It is also important to remember that when running with DEBUG turned on, Django will remember every SQL query it executes. This is useful when you are debugging, but on a production server, it will rapidly consume memory.

“Remembering” means that Django stores all SQL queries and their execution times in a list django.db.connection.queries of the form [{'sql': 'SELECT ...', 'time': '0.01'}, ]. This is useful when you want to profile your queries (see the docs), but needless overhead when you’re just doing heavy-database-access computations. So either set DEBUG to False or clear this list regularly in that case.

Moreover, it is important to understand how Django holds data in memory. Although the resulting objects are constructed “lazily” on the fly, the resulting rows of querysets are kept in memory by default so that multiple iterations on the result can use these cached versions. This can be avoided by using queryset.iterator(). So while

entries = Entry.objects.all()
for entry in entries:
    print entry
for entry in entries:
    print entry

will receive all entries from the database once and keep them in memory,

for entry in entries.iterator():
    print entry
for entry in entries.iterator():
    print entry

will execute the query twice but save memory.

However, even using iterator() can still lead to a heavy memory footprint, not directly on Django’s side, but from the database interface (e.g. the Python MySQLdb module). It will receive and store all the resulting data from the database server before even handing bits over to Django. So the only way to avoid this is to use queries that don’t produce too much data at once. This snippet does exactly that:

def queryset_iterator(queryset, chunksize=1000):
    pk = 0
    last_pk = queryset.order_by('-pk')[0].pk
    queryset = queryset.order_by('pk')
    while pk < last_pk:
        for row in queryset.filter(pk__gt=pk)[:chunksize]:
            pk = row.pk
            yield row

It basically receives slices (default size 1000) of the queryset ordered by the primary key. As a consequence, any ordering previously applied to the queryset is lost. The reason for this behavior is, of course, that ordering (and especially slicing) by primary key is fast. You can use the function like this:

for entry in queryset_iterator(Entry.objects):
    print entry

Apart from not being able to order the queryset, this approach does not handle concurrent modification of the data well: When rows are inserted or deleted while iterating over the dataset with queryset_iterator, rows might be reported several times or never. That should not be a big problem when preprocessing data though.

I modified queryset_iterator slightly to save the initial primary key query and to allow for non-numerical primary keys and also reverse ordering of primary keys:

def queryset_iterator(queryset, chunksize=1000, reverse=False):
    ordering = '-' if reverse else ''
    queryset = queryset.order_by(ordering + 'pk')
    last_pk = None
    new_items = True
    while new_items:
        new_items = False
        chunk = queryset
        if last_pk is not None:
            func = 'lt' if reverse else 'gt'
            chunk = chunk.filter(**{'pk__' + func: last_pk})
        chunk = chunk[:chunksize]
        row = None
        for row in chunk:
            yield row
        if row is not None:
            last_pk = row.pk
            new_items = True

To sum things up:

  • Set DEBUG = False when doing lots of database operations.
  • Use queryset_iterator when you deal with millions of rows and don’t care about ordering.
  • Still enjoy the convenience of Django!

Extreme Geeky Tidying Up: Sorting image colors with Python and Hilbert curves

There are levels of tidiness.

Tidy.
Very Tidy.
Totally Deranged Tidy.

Geeky Tidy.

After sharing these awesome pictures by the Swiss “artist” Ursus Wehrli with the world, Martin N. pointed out that there is still a severe issue with the Badewiese: People are not even sorted by the colors of their swimsuits! Adding to that, trees are placed in total chaos, not to speak of their leaves; clearly, there are tiny waves on the water; and one can almost see blades of grass not being sorted by size. Obviously, this picture needed further tidying up.


© Ursus Wehrli

Python to the rescue! 5 minutes, 10 lines, and half a beer later I had a program that would load the image, re-arrange its pixels, and save it again:

from PIL import Image

old = Image.open("tidy.jpg")
colors = old.getcolors(old.size[0] * old.size[1])
data = []
for count, color in colors:
    data.extend(count * [color])
data.sort()
new = Image.new('RGB', old.size)
new.putdata(data)
new.save("naive_tidy.png")

This yields the following picture:

So how was this “re-arrangement” performed? In a naïve approach, I just sorted the RGB values of the pixels, which is why there are many consecutive lines (of equal reds) with abrupt changes. Ordered, but far from being tidy.

Let’s try another color space and sort by HSV values—a smooth hue should look better, after all. So by adding

import colorsys

and changing one line to

data.sort(key=lambda rgb: colorsys.rgb_to_hsv(*rgb))

we get the following image:

More beautiful, but the red and white noise in the end doesn’t look tidy at all. We want a smooth transition through all given colors. Hilbert curves to the rescue! Found a ready-to-be-used Hilbert walk implementation, added

import hilbert

and changed the sorting once again to

data.sort(key=hilbert.Hilbert_to_int)

to get this image:

Mathematical computer science just made the world a little tidier.

Batch-converting CSV to Google Maps KML to illustrate Slovenian name signs in Carinthia

Inspired by a tweet by DaddyD (which is totally true: the “graphic” by Kleine Zeitung is just a failed table), I figured there should be an easy way of generating maps from a set of places. In this case, to illustrate the places in Carinthia affected by the new agreements concerning Slovenian name signs. Google Maps to the rescue! It allows you to create custom maps with markers (and lines, polygons, etc.) at arbitrary positions—the problem is just that it’s pretty complicated to do that by hand for a bunch of places.

Being a geek, I wanted to automate that, of course. The workflow would be as follows:

  1. create a spreadsheet with columns for the place name, additional text (in this case, the Slovenian place name), additional information for the Google geocoder (region etc.), and the desired marker color, using OpenOffice or whatever,
  2. export that spreadsheet to CSV,
  3. have a Python script batch-geocode all the places and put corresponding markers in a KML file,
  4. import the KML file to Google Maps.

You can download the UTF-8-encoded CSV file for the Slovenian place names.
However, the interesting part is the Python script, of course:

from geopy import geocoders
import csv

geocoder = geocoders.Google()
data = csv.reader(open('data.csv', 'r'), delimiter='\t', quotechar='"')
kml = []
print "Geocoding places"
for line in data:
	place, place_extra, sub_region, region, country, color = [value.decode('utf-8') for value in line]
	if not color:
		color = 'blue'
	query = place
	if sub_region: query += u", " + sub_region
	if region: query += u", " + region
	if country: query += u", " + country
	try:
		name, (lat, lng) = geocoder.geocode(query.encode('utf-8'), exactly_one=True)
		kml.append((u"<Placemark><name>%s/%s</name>" + \
			u"<Style><IconStyle><Icon><href>http://www.google.com/intl/en_us/mapfiles/ms/icons/%s-dot.png</href></Icon></IconStyle></Style>" + \
			u"<Point><coordinates>%s,%s</coordinates></Point></Placemark>") % \
			(place, place_extra, color, lng, lat))
	except ValueError:
		print u"Place not found: %s/%s" % (place, place_extra)
print "Writing data.kml"
kml = u"\n".join(kml)
kml = u"""<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<name>Exported KML data</name>
%s
</Document>
</kml>
""" % kml
out = open('data.kml', 'w')
out.write(kml.encode('utf-8'))
out.close()
print "Done"

You can also download the whole Python script. Given the file data.csv, it looks up all the place names using the geopy Google geocoder and exports the places to a KML file data.kml. You can download the resulting KML file.

The KML file can then be easily imported into a custom Google Maps map, which results in this map:

Kärntner Ortstafeln auf einer größeren Karte anzeigen

Blue markers denote places with Slovenian names according to the old regulation and new VfGH decisions since 2001; red markers represent places with a Slovenian population of more than 17.5%; and yellow markers are considered “room for negotation”.

The general CSV-to-KML converter could be used for many more beautiful maps, of course.

UPDATE: Made some manual corrections to the markers, thanks to comments on Twitter and derStandard.at.