Converting text files to UTF-8

In a rather old project I’m working on again now, there used to be a lot of Latin-1-encoded files. Yuck! I don’t even want to know why anybody ever created or used a character encoding other than UTF-8. So I thought, let’s get these old-school files a decent encoding.

iconv can do the job:

iconv -f L1 -t UTF-8 filename >filename.converted

This will convert the file filename from Latin-1 to UTF-8 and save it as filename.converted.

To find all relevant files in the project directory, we use find, of course. The only issue with this is that a simple for x in `find ...` loop will not handle filenames containing spaces correctly, so we apply while read to it, as in:

find . -name '*.php' | while read x; #...

This will execute the rest with a variable x being assigned every PHP filename in the current directory. (There are other approaches to this as well, of course.)

Now there’s only one problem left to deal with: Some files in the directory are already UTF-8-encoded. Of course, we don’t want to re-encode them again. (Decoding from Latin-1 and encoding to UTF-8 is not idempotent for characters beyond ASCII.) There might be other solutions, but I decided to use Python and the chardet package to determine whether a file is already UTF-8-encoded:

import chardet
if chardet.detect(str)['encoding'].lower() == 'utf-8':
    print ('UTF-8')
else:
    print ('L1')

This will print UTF-8 if the string str is encoded in UTF-8 and L1 otherwise.

Adding some code to output the current file and to remove the original file and replace it by the converted one, we get the following script:

find . -name '*.php' | while read x; do
    e=$(python -c "import chardet; print ('UTF-8' if chardet.detect(file('$x').read())['encoding'].lower() == 'utf-8' else 'L1')")
    echo "converting $x: $e"
    iconv -f $e -t UTF-8 "$x" > "$x.utf8"
    rm "$x"
    mv "$x.utf8" "$x"
done

We can also assemble this into a bash one-liner if we prefer:

find . -name '*.php' | while read x; do e=$(python -c "import chardet; print ('UTF-8' if chardet.detect(file('$x').read())['encoding'].lower() == 'utf-8' else 'L1')"); echo "converting $x: $e"; iconv -f $e -t UTF-8 "$x" > "$x.utf8"; rm "$x"; mv "$x.utf8" "$x"; done

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>