Skip to main content

Ralsina.Me — Roberto Alsina's website

Unicode in Python is Fun!

As I hope you know, if you get a string of bytes, and want the text in it, and that text may be non-asci­i, what you need to do is de­code the string us­ing the cor­rect en­cod­ing name:

>>> 'á'.decode('utf8')

How­ev­er, there is a gotcha there. You have to be ab­so­lute­ly sure that the thing you are de­cod­ing is a string of bytes, and not a uni­code ob­jec­t. Be­cause uni­code ob­jects al­so have a de­code method­ but it's an in­cred­i­bly use­less one, whose on­ly pur­pose in life is caus­ing this pe­cu­liar er­ror:

>>> u'á'.decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1'
in position 0: ordinal not in range(128)

Why pe­cu­liar? Be­cause it's an En­code er­ror. Caused by call­ing de­code. You see, on uni­code ob­ject­s, de­code does some­thing like this:

def decode(self, encoding):
    return self.encode('ascii').decode(encoding)

The us­er wants a uni­code ob­jec­t. He has a uni­code ob­jec­t. By def­i­ni­tion, there is no such thing as a way to ut­f-8-de­code a uni­code ob­jec­t. It just ­makes NO SENSE. It's like ask­ing for a way to comb a fish, or climb a lake.

What it should return is self! Also, it's annoying as all hell in that the only way to avoid it is to check for type, which is totally unpythonic.

Or even bet­ter, let's just not have a de­code method on uni­code ob­ject­s, which I think is the case in python 3, and I know we will nev­er get on python 2.

So, be aware of it, and good luck!


Comments powered by Disqus