Skip to main content

Posts about programming (old posts, page 36)

Unicode in Python is Fun!

As I hope you know, if you get a string of bytes, and want the text in it, and that text may be non-ascii, what you need to do is decode the string using the correct encoding name:

>>> 'á'.decode('utf8')

However, there is a gotcha there. You have to be absolutely sure that the thing you are decoding is a string of bytes, and not a unicode object. Because unicode objects also have a decode method but it's an incredibly useless one, whose only purpose in life is causing this peculiar error:

>>> u'á'.decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1'
in position 0: ordinal not in range(128)

Why peculiar? Because it's an Encode error. Caused by calling decode. You see, on unicode objects, decode does something like this:

def decode(self, encoding):
    return self.encode('ascii').decode(encoding)

The user wants a unicode object. He has a unicode object. By definition, there is no such thing as a way to utf-8-decode a unicode object. It just makes NO SENSE. It's like asking for a way to comb a fish, or climb a lake.

What it should return is self! Also, it's annoying as all hell in that the only way to avoid it is to check for type, which is totally unpythonic.

Or even better, let's just not have a decode method on unicode objects, which I think is the case in python 3, and I know we will never get on python 2.

So, be aware of it, and good luck!

How it's done

I added a very minor feature to the site. Up here ^ you should be able to see a link that says "reSt". If you click on it, it will show you the "source code" for the page.

I did this for a few reasons:

  1. Because a comment seemed to suggest it ;-)

  2. Because it seems like a nice thing to do. Since I so like reSt, I would like others to use it, too. And showing how easy it is to write using it, is cool.

  3. It's the "free software-y" thing to do. I am providing you the preferred way to modify my posts.

  4. It was ridiculously easy to add.

Also, if you see something missing, or something you would like to have on the site, please comment, I will try to add it.

Nikola is Near

I managed to do some minor work today on Nikola, the static website generator used to generate ... well, this static website.

  • Implemented tags (including per-tag RSS feeds)

  • Simplified templates

  • Separated code and configuration.

The last one was the trickiest. And as a teaser, here is the full configuration file to create this site, except HTML bits for analytics, google custom search and whatever that would make no sense on other sites. I hope it's somewhat clear.

# -*- coding: utf-8 -*-

# post_pages contains (wildcard, destination, template) tuples.
# The wildcard is used to generate a list of reSt source files (whatever/thing.txt)
# That fragment must have an associated metadata file (whatever/thing.meta),
# and opcionally translated files (example for spanish, with code "es"):
#     whatever/ and whatever/
# From those files, a set of HTML fragment files will be generated:
# whatever/thing.html (and maybe whatever/
# These files are combinated with the template to produce rendered
# pages, which will be placed at
# output / TRANSLATIONS[lang] / destination / pagename.html
# where "pagename" is specified in the metadata file.

post_pages = (
    ("posts/*.txt", "weblog/posts", "post.tmpl"),
    ("stories/*.txt", "stories", "post.tmpl"),

# What is the default language?


# What languages do you have?
# If a specific post is not translated to a language, then the version
# in the default language will be shown instead.
# The format is {"translationcode" : "path/to/translation" }
# the path will be used as a prefix for the generated pages location

    "en": "",
    "es": "tr/es",

# Data about this site
BLOG_TITLE = "Lateral Opinion"
BLOG_URL = "//"
BLOG_EMAIL = "[email protected]"
BLOG_DESCRIPTION = "I write free software. I have an opinion on almost "\
    "everything. I write quickly. A weblog was inevitable."

# Paths for different autogenerated bits. These are combined with the translation
# paths.

# Final locations are:
# output / TRANSLATION[lang] / TAG_PATH / index.html (list of tags)
# output / TRANSLATION[lang] / TAG_PATH / tag.html (list of posts for a tag)
# output / TRANSLATION[lang] / TAG_PATH / tag.xml (RSS feed for a tag)
TAG_PATH = "categories"
# Final location is output / TRANSLATION[lang] / INDEX_PATH / index-*.html
INDEX_PATH = "weblog"
# Final locations for the archives are:
# output / TRANSLATION[lang] / ARCHIVE_PATH / archive.html
# output / TRANSLATION[lang] / ARCHIVE_PATH / YEAR / index.html
ARCHIVE_PATH = "weblog"
# Final locations are:
# output / TRANSLATION[lang] / RSS_PATH / rss.xml
RSS_PATH = "weblog"

# A HTML fragment describing the license, for the sidebar.
    <a rel="license" href="">
    <img alt="Creative Commons License" style="border-width:0; margin-bottom:12px;"

# A search form to search this site, for the sidebar. Has to be a <li>
# for the default template (base.tmpl).
    <!-- google custom search -->
    <!-- End of google custom search -->

# Google analytics or whatever else you use. Added to the bottom of <body>
# in the default template (base.tmpl).
        <!-- Start of StatCounter Code -->
        <!-- End of StatCounter Code -->
        <!-- Start of Google Analytics -->
        <!-- End of Google Analytics -->

# Put in global_context things you want available on all your templates.
# It can be anything, data, functions, modules, etc.
    'analytics': ANALYTICS,
    'blog_title': BLOG_TITLE,
    'blog_url': BLOG_URL,
    'translations': TRANSLATIONS,
    'license': LICENSE,
    'search_form': SEARCH_FORM,
    # Locale-dependent links
    'archives_link': {
        'es': '<a href="/tr/es/weblog/archive.html">Archivo</a>',
        'en': '<a href="/weblog/archive.html">Archives</a>',
    'tags_link': {
        'es': '<a href="/tr/es/categories/index.html">Tags</a>',
        'en': '<a href="/categories/index.html">Tags</a>',


Welcome To Nikola

If you see this, you may notice some changes in the site.

So, here is a short explanation:

  • I changed the software and the templates for this blog.

  • Yes, it's a work in progress.

  • The new software is called Nikola.

  • Yes, it's pretty cool.

Why change?

Are you kidding? My previous blog-generator (Son of BartleBlog) was not in good shape. The archives only covered 2000-2010, the "previous posts" links were a lottery, and the spanish version of the site was missing whole sections.

So, what's Nikola?

Nikola is a static website generator. One thing about this site is that it is, and has always been, just HTML. Every "dynamic" thing you see in it, like comments, is a third party service. This site is just a bunch of HTML files sitting in a folder.

So, how does Nikola work?

Nikola takes a folder full of txt files written in restructured text, and generates HTML fragments.

Those fragments plus some light metadata (title, tags, desired output filename, external links to sources) and Some Mako Templates create HTML pages.

Those HTML pages use bootstrap to not look completely broken (hey, I never claimed to be a designer).

To make sure I don't do useless work, doit makes sure only the required files are recreated.

Why not use <whatever>?

Because, for diverse reasons, I wanted to keep the exact URLs I have been using:

  • If I move a page, keeping the Disqus comments attached gets tricky

  • Some people may have bookmarked them

Also, I wanted:

  • Mako templates (because I like Mako)

  • Restructured text (Because I have over 1000 posts written in it)

  • Python (so I could hack it)

  • Easy to hack (currently Nikola is under 600 LOC, and is almost feature complete)

  • Support for a multilingual blog like this one.

And of course:

  • It sounded like a fun, short project. I had the suspicion that with a bit of glue, existing tools did 90% of the work. Looks like I was right, since I wrote it in a few days.

Are you going to maintain it?

Sure, since I am using it.

Is it useful for other people?

Probably not right now, because it makes a ton of assumptions for my site. I need to clean it up a bit before it's really nice.

Can other people use it?

Of course. It will be available somewhere soon.

Missing features?

No tags yet. Some other minor missing things.

Ubuntu One APIs by Example (part 1)

One of the nice things about working at Canonical is that we produce open source software. I, specifically, work in the team that does the desktop clients for Ubuntu One which is a really cool job, and a really cool piece of software. However, one thing not enough people know, is that we offer damn nice APIs for developers. We have to, since all our client code is open source, so we need those APIs for ourselves.

So, here is a small tutorial about using some of those APIs. I did it using Python and PyQt for several reasons:

  • Both are great tools for prototyping

  • Both have good support for the required stuff (DBus, HTTP, OAuth)

  • It's what I know and enjoy. Since I did this code on a sunday, I am not going to use other things.

Having said that, there is nothing python-specific or Qt-specific in the code. Where I do a HTTP request using QtNetwork, you are free to use libsoup, or whatever.

So, on to the nuts and bolts. The main pieces of Ubuntu One, from a infrastructure perspective, are Ubuntu SSO Client, that handles user registration and login, and SyncDaemon, which handles file synchronization.

To interact with them, on Linux, they offer DBus interfaces. So, for example, this is a fragment of code showing a way to get the Ubuntu One credentials (this would normally be part of an object's __init__):

# Get the session bus
bus = dbus.SessionBus()


# Get the credentials proxy and interface
self.creds_proxy = bus.get_object("com.ubuntuone.Credentials",

# Connect to signals so you get a call when something
# credential-related happens
self.creds_iface = dbus.Interface(self.creds_proxy,

# Call for credentials
self._credentials = None

You may have noticed that get_credentials doesn't actually return the credentials. What it does is, it tells SyncDaemon to fetch the credentials, and then, when/if they are there, one of the signals will be emitted, and one of the connected methods will be called. This is nice, because it means you don't have to worry about your app blocking while SyncDaemon is doing all this.

But what's in those methods we used? Not much, really!

def get_credentials(self):
    # Do we have them already? If not, get'em
    if not self._credentials:
    # Return what we've got, could be None
    return self._credentials

def creds_found(self, data):
    # Received credentials, save them.
    print "creds_found", data
    self._credentials = data
    # Don't worry about get_quota yet ;-)
    if not self._quota_info:

def creds_not_found(self, data):
    # No credentials, remove old ones.
    print "creds_not_found", data
    self._credentials = None

def creds_error(self, data):
    # No credentials, remove old ones.
    print "creds_error", data
    self._credentials = None

So, basically, self._credentials will hold a set of credentials, or None. Congratulations, we are now logged into Ubuntu One, so to speak.

So, let's do something useful! How about asking for how much free space there is in the account? For that, we can't use the local APIs, we have to connect to the servers, who are, after all, the ones who decide if you are over quota or not.

Access is controlled via OAuth. So, to access the API, we need to sign our requests. Here is how it's done. It's not particularly enlightening, and I did not write it, I just use it:

def sign_uri(self, uri, parameters=None):
    # Without credentials, return unsigned URL
    if not self._credentials:
        return uri
    if isinstance(uri, unicode):
        uri = bytes(iri2uri(uri))
    print "uri:", uri
    method = "GET"
    credentials = self._credentials
    consumer = oauth.OAuthConsumer(credentials["consumer_key"],
    token = oauth.OAuthToken(credentials["token"],
    if not parameters:
        _, _, _, _, query, _ = urlparse(uri)
        parameters = dict(cgi.parse_qsl(query))
    request = oauth.OAuthRequest.from_consumer_and_token(
    sig_method = oauth.OAuthSignatureMethod_HMAC_SHA1()
    request.sign_request(sig_method, consumer, token)
    print "SIGNED:", repr(request.to_url())
    return request.to_url()

And how do we ask for the quota usage? By accessing the entry point with the proper authorization, we would get a JSON dictionary with total and used space. So, here's a simple way to do it:

    # This is on __init__
    self.nam = QtNetwork.QNetworkAccessManager(self,


def get_quota(self):
    """Launch quota info request."""
    uri = self.sign_uri(QUOTA_API)
    url = QtCore.QUrl()

Again, see how get_quota doesn't return the quota? What happens is that get_quota will launch a HTTP request to the Ubuntu One servers, which will, eventually, reply with the data. You don't want your app to block while you do that. So, QNetworkAccessManager will call self.reply_finished when it gets the response:

def reply_finished(self, reply):
    if unicode(reply.url().path()) == u'/api/quota/':
        # Handle quota responses
        self._quota_info = json.loads(unicode(reply.readAll()))
        print "Got quota: ", self._quota_info
        # Again, don't worry about update_menu yet ;-)

What else would be nice to have? How about getting a call whenever the status of syncdaemon changes? For example, when sync is up to date, or when you get disconnected? Again, those are DBus signals we are connecting in our __init__:

self.status_proxy = bus.get_object(
    'com.ubuntuone.SyncDaemon', '/status')
self.status_iface = dbus.Interface(self.status_proxy,
    'StatusChanged', self.status_changed)

# Get the status as of right now
self._last_status = self.process_status(

And what's status_changed?

def status_changed(self, status):
    print "New status:", status
    self._last_status = self.process_status(status)

The process_status function is boring code to convert the info from syncdaemon's status into a human-readable thing like "Sync is up-to-date". So we store that in self._last_status and update the menu.

What menu? Well, a QSystemTrayIcon's context menu! What you have read are the main pieces you need to create something useful: a Ubuntu One tray app you can use in KDE, XFCE or openbox. Or, if you are on unity and install sni-qt, a Ubuntu One app indicator!

My Ubuntu One indicator in action.

You can find the source code for the whole example app at my u1-toys project in launchpad and here is the full source code (missing some icon resources, just get the repo)

Coming soon(ish), more example apps, and cool things to do with our APIs!

rst2pdf 0.90 is out

Yes, after many moons, it's out. Here is the (as usual) incomplete changelog:

  • Added raw HTML support, by Dimitri Christodoulou

  • Fixed Issue 422: Having no .afm files made font lookup slow.

  • Fixed Issue 411: Sometimes the windows registry has the font's abspath.

  • Fixed Issue 430: Using --config option caused other options to be ignored (by charles at cstanhope dot com)

  • Fixed Issue 436: Add pdf_style_path to sphinx (by [email protected])

  • Fixed Issue 428: page numbers logged as errors

  • Added support for many pygments options in code-block (by Joaquin Sorianello)

  • Implemented Issue 404: plantuml support

  • Issue 399: support sphinx's template path option

  • Fixed Issue 406: calls to the wrong logging function

  • Implemented Issue 391: New --section-header-depth option.

  • Fixed Issue 390: the --config option was ignored.

  • Added support for many pygments options in code-block (by Joaquin Sorianello)

  • Fixed Issue 379: Wrong style applied to paragraphs in definitions.

  • Fixed Issue 378: Multiline :address: were shown collapsed.

  • Implemented Issue 11: FrameBreak (and conditional FrameBreak)

  • The description of frames in page templates was just wrong.

  • Fixed Issue 374: in some cases, literal blocks were split inside a page, or the pagebreak came too early.

  • Fixed Issue 370: warning about sphinx.addnodes.highlightlang not being handled removed.

  • Fixed Issue 369: crash in hyphenator when specifying "en" as a language.

  • Compatibility fix to Sphinx 0.6.x (For python 2.7 docs)

This release did not focus on Sphinx bugs, so those are probably still there. Hopefully the next round is attacking those.

Scraping doesn't hurt

I am in general allergic to HTML, specially when it comes to parsing it. However, every now and then something comes up and it's fun to keep the muscles stretched.

So, consider the Ted Talks site. They have a really nice table with information about their talks, just in case you want to do something with them.

But how do you get that information? By scraping it. And what's an easy way to do it? By using Python and BeautifulSoup:

from BeautifulSoup import BeautifulSoup
import urllib

# Read the whole page.
data = urllib.urlopen('').read()
# Parse it
soup = BeautifulSoup(data)

# Find the table with the data
table = soup.findAll('table', attrs= {"class": "downloads notranslate"})[0]
# Get the rows, skip the first one
rows = table.findAll('tr')[1:]

items = []
# For each row, get the data
# And store it somewhere
for row in rows:
    cells = row.findAll('td')
    item = {}
    item['date'] = cells[0].text
    item['event'] = cells[1].text
    item['title'] = cells[2].text
    item['duration'] = cells[3].text
    item['links'] = [a['href'] for a in cells[4].findAll('a')]

And that's it! Surprisingly pain-free!

To write, and to write what.

Some of you may know I have written about 30% of a book, called "Python No Muerde", available at (in spanish only).That book has stagnated for a long time.

On the other hand, I wrote a very popular series of posts, called PyQt by Example, which has (you guessed it) stagnated for a long time.

The main problem with the book was that I tried to cover way too much ground. When complete, it would be a 500 page book, and that would involve writing half a dozen example apps, some of them in areas I am no expert.

The main problem with the post series is that the example is lame (a TODO app!) and expanding it is boring.

¡So, what better way to fix both things at once, than to merge them!

I will leave Python No Muerde as it is, and will do a new book, called PyQt No Muerde. It will keep the tone and language of Python No Muerde, and will even share some chapters, but will focus on developing a PyQt app or two, instead of the much more ambitious goals of Python No Muerde. It will be about 200 pages.

I have acquired permission from my superiors (my wife) to work on this project a couple of hours a day, in the early morning. So, it may move forward, or it may not. This is, as usual, an experiment, not a promise.

Garbage Collection Has Side Effects

Just a quick followup to The problem is is, is it not? This is not mine, I got it from reddit

This should really not surprise you:

>>> a = [1,2]
>>> b = [3,4]
>>> a is b
>>> a == b
>>> id(a) == id(b)

After all, a and b are completely different things. However:

>>> [1,2] is [3,4]
>>> [1,2] == [3,4]
>>> id([1,2]) == id([3,4])

Turns out that using literals, one of those things is not like the others.

First, the explanation so you understand why this happens. When you don't have any more references to a piece of data, it will get garbage collected, the memory will be freed, so it can be reused for other things.

In the first case, I am keeping references to both lists in the variables a and b. That means the lists have to exist at all times, since I can always say print a and python has to know what's in it.

In the second case, I am using literals, which means there is no reference to the lists after they are used. When python evaluates id([1,2]) == id([3,4]) it first evaluates the left side of the ==. After that is done, there is no need to keep [1,2] available, so it's deleted. Then, when evaluating the right side, it creates [3,4].

By pure chance, it will use the exact same place for it as it was using for [1,2]. So id will return the same value. This is just to remind you of a couple of things:

  1. a is b is usually (but not always) the same as id(a) == id(b)

  2. garbage collection can cause side effects you may not be expecting

The problem is is. Is it not?

This has been a repeated discussion in the Python Argentina mailing list. Since it has not come up in a while, why not recap it, so the next time it happens people can just link here.

Some people for some reason do this:

>>> a = 2
>>> b = 2
>>> a == b
>>> a is b

And then, when they do this, they are surprised:

>>> a = 1000
>>> b = 1000
>>> a == b
>>> a is b

They are surprised because "2 is 2" makes more intuitive sense than "1000 is not 1000". This could be attributed to an inclination towards platonism, but really, it's because they don't know what is is.

The is operator is (on CPython) simply a memory address comparison. if objects a and b are the same exact chunk of memory, then they "are" each other. Since python pre-creates a bunch of small integers, then every 2 you create is really not a new 2, but the same 2 of last time.

This works because of two things:

  1. Integers are read-only objects. You can have as many variables "holding" the same 2, because they can't break it.

  2. In python, assignment is just aliasing. You are not making a copy of 2 when you do a = 2, you are just saying "a is another name for this 2 here".

This is surprising for people coming from other languages, like, say, C or C++. In those languages, a variable int a will never use the same memory space as another variable int b because a and b are names for specific bytes of memory, and you can change the contents of those bytes. On C and C++, integers are a mutable type. This 2 is not that 2, unless you do it intentionally using pointers.

In fact, the way assignment works on Python also leads to other surprises, more interesting in real life. For example, look at this session:

>>> def f(s=""):
...     s+='x'
...     return s
>>> f()
>>> f()
>>> f()

That is really not surprising. Now, let's make a very small change:

>>> def f(l=[]):
...     l.append('x')
...     return l
>>> f()
>>> f()
['x', 'x']
>>> f()
['x', 'x', 'x']

And that is, for someone who has not seen it before, surprising. It happens because lists are a mutable type. The default argument is defined when the function is parsed, and every time you call f() you are using and returning the same l. Before, you were also using always the same s but since strings are immutable, it never changed, and you were returning a new string each time.

You could check that I am telling you the truth, using is, of course. And BTW, this is not a problem just for lists. It's a problem for objects of every class you create yourself, unless you bother making it immutable somehow. So let's be careful with default arguments, ok?

But the main problem about finding the original 1000 is not 1000 thing surprising is that, in truth, it's uninteresting. Integers are fungible. You don't care if they are the same integer, you only really care that they are equal.

Testing for integer identity is like worrying, after you loan me $1, about whether I return you a different or the same $1 coin. It just doesn't matter. What you want is just a $1 coin, or a 2, or a 1000.

Also, the result of 2 is 2 is implementation dependent. There is no reason, beyond an optimization, for that to be True.

Hoping this was clear, let me give you a last snippet:

>>> a = float('NaN')
>>> a is a
>>> a == a

UPDATE: lots of fun and interesting comments about this post at reddit and a small followup here