Unicode in Python is Fun!

2012-03-30 13:58 | Also available in: Español

As I hope you know, if you get a string of bytes, and want the text in it, and that text may be non-ascii, what you need to do is decode the string using the correct encoding name:

>>> 'á'.decode('utf8')
u'\xe1'

However, there is a gotcha there. You have to be absolutely sure that the thing you are decoding is a string of bytes, and not a unicode object. Because unicode objects also have a decode method but it's an incredibly useless one, whose only purpose in life is causing this peculiar error:

>>> u'á'.decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1'
in position 0: ordinal not in range(128)

Why peculiar? Because it's an Encode error. Caused by calling decode. You see, on unicode objects, decode does something like this:

def decode(self, encoding):
    return self.encode('ascii').decode(encoding)

The user wants a unicode object. He has a unicode object. By definition, there is no such thing as a way to utf-8-decode a unicode object. It just makes NO SENSE. It's like asking for a way to comb a fish, or climb a lake.

What it should return is self! Also, it's annoying as all hell in that the only way to avoid it is to check for type, which is totally unpythonic.

Or even better, let's just not have a decode method on unicode objects, which I think is the case in python 3, and I know we will never get on python 2.

So, be aware of it, and good luck!

Juan B. Cabral / 2012-03-30 17:40:

supongo que no quiere devolver "self" por que decode tiene que devolver "OTRO" objeto unicode deberia devolver unicode(self)

Roberto Alsina / 2012-03-30 17:42:

Los objetos unicode son inmutables, así que no veo una diferencia práctica entre self y unicode(self)...

Juan B. Cabral / 2012-03-30 17:49:

practica inmediata seria que no rompa una validacion "is not" futura

Roberto Alsina / 2012-03-30 17:52:

Si estás validando strings usando "is" merecés que te falle :-)

Juan B. Cabral / 2012-04-11 23:38:

no pense en validar string con "is" pense en validar dos referencias al posible mismo objeto con "is" (osea pa lo que sirve is)

igual str hace lo mismo.... pero no me gusta

Joe / 2012-03-30 21:49:

No hay objetos unicode en Python 3, solo str's (que tienen encode) y bytes (que tienen decode).

Roberto Alsina / 2012-03-30 21:54:

Si, los str en python 3 son unicode. Gracias por la aclaracion.

claudio canepa / 2012-03-31 01:24:

OT: para snippets chicos como los de hoy la zona del fin del code mas las siguientes lineas de texto normal no lucen del todo bien;

Si quisieras que la linea en blanco al final del .. code quedara fuera del block habria alguna manera sencilla de indicarlo en reST ?

Roberto Alsina / 2012-03-31 01:43:

habría que revisar un cachito el CSS, debe ser un padding mal puesto nomas.

Ralsina.Me — Roberto Alsina's website