Skip to main content

Ralsina.Me — Roberto Alsina's website

Unicode in Python is Fun!

As I hope you know, if you get a string of bytes, and want the text in it, and that text may be non-asci­i, what you need to do is de­code the string us­ing the cor­rect en­cod­ing name:

>>> 'á'.decode('utf8')
u'\xe1'

How­ev­er, there is a gotcha there. You have to be ab­so­lute­ly sure that the thing you are de­cod­ing is a string of bytes, and not a uni­code ob­jec­t. Be­cause uni­code ob­jects al­so have a de­code method but it's an in­cred­i­bly use­less one, whose on­ly pur­pose in life is caus­ing this pe­cu­liar er­ror:

>>> u'á'.decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1'
in position 0: ordinal not in range(128)

Why pe­cu­liar? Be­cause it's an En­code er­ror. Caused by call­ing de­code. You see, on uni­code ob­ject­s, de­code does some­thing like this:

def decode(self, encoding):
    return self.encode('ascii').decode(encoding)

The us­er wants a uni­code ob­jec­t. He has a uni­code ob­jec­t. By def­i­ni­tion, there is no such thing as a way to ut­f-8-de­code a uni­code ob­jec­t. It just makes NO SENSE. It's like ask­ing for a way to comb a fish, or climb a lake.

What it should return is self! Also, it's annoying as all hell in that the only way to avoid it is to check for type, which is totally unpythonic.

Or even bet­ter, let's just not have a de­code method on uni­code ob­ject­s, which I think is the case in python 3, and I know we will nev­er get on python 2.

So, be aware of it, and good luck!

Juan B. Cabral / 2012-03-30 17:40:

supongo que no quiere devolver "self" por que decode tiene que devolver "OTRO" objeto unicode deberia devolver unicode(self)

Roberto Alsina / 2012-03-30 17:42:

Los objetos unicode son inmutables, así que no veo una diferencia práctica entre self y unicode(self)...

Juan B. Cabral / 2012-03-30 17:49:

practica inmediata seria que no rompa una validacion "is not" futura

Roberto Alsina / 2012-03-30 17:52:

Si estás validando strings usando "is" merecés que te falle :-)

Juan B. Cabral / 2012-04-11 23:38:

no pense en validar string con "is" pense en validar dos referencias al posible mismo objeto con "is" (osea pa lo que sirve is)

igual str hace lo mismo.... pero no me gusta

Joe / 2012-03-30 21:49:

No hay objetos unicode  en Python 3, solo str's (que tienen encode) y bytes (que tienen decode).

Roberto Alsina / 2012-03-30 21:54:

Si, los str en python 3 son unicode. Gracias por la aclaracion.

claudio canepa / 2012-03-31 01:24:

OT: para snippets chicos como los de hoy la zona del fin del code mas las siguientes lineas de texto normal no lucen del todo bien;

Si quisieras que la linea en blanco al final del .. code quedara fuera del block habria alguna manera sencilla de indicarlo en reST ?   

Roberto Alsina / 2012-03-31 01:43:

habría que revisar un cachito el CSS, debe ser un padding mal puesto nomas.