Skip to main content

Ralsina.Me — Roberto Alsina's website

Posts about programming (old posts, page 71)

rst2pdf 0.90 is out

Yes, af­ter many moon­s, it's out. Here is the (as usu­al) in­com­plete changel­og:

  • Added raw HTML sup­­port, by Dim­itri Christodoulou

  • Fixed Is­­sue 422: Hav­ing no .afm files made font lookup slow.

  • Fixed Is­­sue 411: Some­­times the win­­dows reg­istry has the font's ab­s­path.

  • Fixed Is­­sue 430: Us­ing --­­con­­fig op­­tion caused oth­­er op­­tions to be ig­nored (by charles at cstan­hope dot com)

  • Fixed Is­­sue 436: Add pdf_style_­­path to sphinx (by tyler@­­datas­­tax.­­com)

  • Fixed Is­­sue 428: page num­bers logged as er­rors

  • Added sup­­port for many pyg­­ments op­­tions in code-block (by Joaquin So­ri­anel­lo)

  • Im­­ple­­men­t­ed Is­­sue 404: plan­­tuml sup­­port

  • Is­­sue 399: sup­­port sphinx's tem­­plate path op­­tion

  • Fixed Is­­sue 406: calls to the wrong log­ging func­­tion

  • Im­­ple­­men­t­ed Is­­sue 391: New --­sec­­tion-­­head­­er-depth op­­tion.

  • Fixed Is­­sue 390: the --­­con­­fig op­­tion was ig­nored.

  • Added sup­­port for many pyg­­ments op­­tions in code-block (by Joaquin So­ri­anel­lo)

  • Fixed Is­­sue 379: Wrong style ap­­plied to para­­graphs in de­f­i­ni­­tion­s.

  • Fixed Is­­sue 378: Mul­ti­­line :ad­­dress: were shown col­lapsed.

  • Im­­ple­­men­t­ed Is­­sue 11: Frame­Break (and con­di­­tion­al Frame­Break)

  • The de­scrip­­tion of frames in page tem­­plates was just wrong.

  • Fixed Is­­sue 374: in some cas­es, lit­er­al blocks were split in­­­side a page, or the page­break came too ear­­ly.

  • Fixed Is­­sue 370: warn­ing about sphinx.addnodes.high­­­light­lang not be­ing han­­dled re­­moved.

  • Fixed Is­­sue 369: crash in hy­phen­a­tor when spec­i­­fy­ing "en" as a lan­guage.

  • Com­­pat­i­­bil­i­­ty fix to Sphinx 0.6.x (For python 2.7 doc­s)

This re­lease did not fo­cus on Sphinx bugs, so those are prob­a­bly still there. Hope­ful­ly the next round is at­tack­ing those.

Scraping doesn't hurt

I am in gen­er­al al­ler­gic to HTM­L, spe­cial­ly when it comes to pars­ing it. How­ev­er, ev­ery now and then some­thing comes up and it's fun to keep the mus­cles stretched.

So, con­sid­er the Ted Talks site. They have a re­al­ly nice ta­ble with in­for­ma­tion about their talk­s, just in case you want to do some­thing with them.

But how do you get that in­for­ma­tion? By scrap­ing it. And what's an easy way to do it? By us­ing Python and Beau­ti­ful­Soup:

from BeautifulSoup import BeautifulSoup
import urllib

# Read the whole page.
data = urllib.urlopen('http://www.ted.com/talks/quick-list').read()
# Parse it
soup = BeautifulSoup(data)

# Find the table with the data
table = soup.findAll('table', attrs= {"class": "downloads notranslate"})[0]
# Get the rows, skip the first one
rows = table.findAll('tr')[1:]

items = []
# For each row, get the data
# And store it somewhere
for row in rows:
    cells = row.findAll('td')
    item = {}
    item['date'] = cells[0].text
    item['event'] = cells[1].text
    item['title'] = cells[2].text
    item['duration'] = cells[3].text
    item['links'] = [a['href'] for a in cells[4].findAll('a')]
    items.append(item)

And that's it! Sur­pris­ing­ly pain-free!

To write, and to write what.

Some of you may know I have writ­ten about 30% of a book, called "Python No Muerde", avail­able at http://no­muerde.net­man­ager­s.­com.ar (in span­ish on­ly).That book has stag­nat­ed for a long time.

On the oth­er hand, I wrote a very pop­u­lar se­ries of post­s, called PyQt by Ex­am­ple, which has (y­ou guessed it) stag­nat­ed for a long time.

The main prob­lem with the book was that I tried to cov­er way too much ground. When com­plete, it would be a 500 page book, and that would in­volve writ­ing half a dozen ex­am­ple app­s, some of them in ar­eas I am no ex­pert.

The main prob­lem with the post se­ries is that the ex­am­ple is lame (a TO­DO ap­p!) and ex­pand­ing it is bor­ing.

¡So, what bet­ter way to fix both things at on­ce, than to merge them!

I will leave Python No Muerde as it is, and will do a new book, called PyQt No Muerde. It will keep the tone and lan­guage of Python No Muerde, and will even share some chap­ter­s, but will fo­cus on de­vel­op­ing a PyQt app or two, in­stead of the much more am­bi­tious goals of Python No Muerde. It will be about 200 pages.

I have ac­quired per­mis­sion from my su­pe­ri­ors (my wife) to work on this project a cou­ple of hours a day, in the ear­ly morn­ing. So, it may move for­ward, or it may not. This is, as usu­al, an ex­per­i­men­t, not a prom­ise.

Garbage Collection Has Side Effects

Just a quick fol­lowup to The prob­lem is is, is it not? This is not mine, I got it from red­dit

This should re­al­ly not sur­prise you:

>>> a = [1,2]
>>> b = [3,4]
>>> a is b
False
>>> a == b
False
>>> id(a) == id(b)
False

Af­ter al­l, a and b are com­plete­ly dif­fer­ent things. How­ev­er:

>>> [1,2] is [3,4]
False
>>> [1,2] == [3,4]
False
>>> id([1,2]) == id([3,4])
True

Turns out that us­ing lit­er­al­s, one of those things is not like the oth­er­s.

First, the ex­pla­na­tion so you un­der­stand why this hap­pen­s. When you don't have any more ref­er­ences to a piece of data, it will get garbage col­lect­ed, the mem­o­ry will be freed, so it can be reused for oth­er things.

In the first case, I am keeping references to both lists in the variables a and b. That means the lists have to exist at all times, since I can always say print a and python has to know what's in it.

In the second case, I am using literals, which means there is no reference to the lists after they are used. When python evaluates id([1,2]) == id([3,4]) it first evaluates the left side of the ==. After that is done, there is no need to keep [1,2] available, so it's deleted. Then, when evaluating the right side, it creates [3,4].

By pure chance, it will use the exact same place for it as it was using for [1,2]. So id will return the same value. This is just to remind you of a couple of things:

  1. a is b is usu­al­ly (but not al­ways) the same as id(a) == id(b)

  2. garbage col­lec­­tion can cause side ef­­fects you may not be ex­pec­t­ing

The problem is is. Is it not?

This has been a re­peat­ed dis­cus­sion in the Python Ar­genti­na mail­ing list. Since it has not come up in a while, why not re­cap it, so the next time it hap­pens peo­ple can just link here.

Some peo­ple for some rea­son do this:

>>> a = 2
>>> b = 2
>>> a == b
True
>>> a is b
True

And then, when they do this, they are sur­prised:

>>> a = 1000
>>> b = 1000
>>> a == b
True
>>> a is b
False

They are sur­prised be­cause "2 is 2" makes more in­tu­itive sense than "1000 is not 1000". This could be at­trib­uted to an in­cli­na­tion to­wards pla­ton­is­m, but re­al­ly, it's be­cause they don't know what is is.

The is op­er­a­tor is (on CPython) sim­ply a mem­o­ry ad­dress com­par­i­son. if ob­jects a and b are the same ex­act chunk of mem­o­ry, then they "are" each oth­er. Since python pre-cre­ates a bunch of small in­te­gers, then ev­ery 2 you cre­ate is re­al­ly not a new 2, but the same 2 of last time.

This works be­cause of two things:

  1. In­­te­gers are read­­-on­­ly ob­­jec­t­s. You can have as many var­i­ables "hold­ing" the same 2, be­­cause they can't break it.

  2. In python, as­sign­­ment is just alias­ing. You are not mak­ing a copy of 2 when you do a = 2, you are just say­ing "a is an­oth­er name for this 2 here".

This is sur­pris­ing for peo­ple com­ing from oth­er lan­guages, like, say, C or C++. In those lan­guages, a vari­able int a will nev­er use the same mem­o­ry space as an­oth­er vari­able int b be­cause a and b are names for spe­cif­ic bytes of mem­o­ry, and you can change the con­tents of those bytes. On C and C++, in­te­gers are a mu­ta­ble type. This 2 is not that 2, un­less you do it in­ten­tion­al­ly us­ing point­er­s.

In fac­t, the way as­sign­ment works on Python al­so leads to oth­er sur­pris­es, more in­ter­est­ing in re­al life. For ex­am­ple, look at this ses­sion:

>>> def f(s=""):
...     s+='x'
...     return s
...
>>> f()
'x'
>>> f()
'x'
>>> f()
'x'

That is re­al­ly not sur­pris­ing. Now, let's make a very small change:

>>> def f(l=[]):
...     l.append('x')
...     return l
...
>>> f()
['x']
>>> f()
['x', 'x']
>>> f()
['x', 'x', 'x']

And that is, for some­one who has not seen it be­fore, sur­pris­ing. It hap­pens be­cause lists are a mu­ta­ble type. The de­fault ar­gu­ment is de­fined when the func­tion is parsed, and ev­ery time you call f() you are us­ing and re­turn­ing the same l. Be­fore, you were al­so us­ing al­ways the same s but since strings are im­mutable, it nev­er changed, and you were re­turn­ing a new string each time.

You could check that I am telling you the truth, us­ing is, of course. And BTW, this is not a prob­lem just for list­s. It's a prob­lem for ob­jects of ev­ery class you cre­ate your­self, un­less you both­er mak­ing it im­mutable some­how. So let's be care­ful with de­fault ar­gu­ments, ok?

But the main problem about finding the original 1000 is not 1000 thing surprising is that, in truth, it's uninteresting. Integers are fungible. You don't care if they are the same integer, you only really care that they are equal.

Test­ing for in­te­ger iden­ti­ty is like wor­ry­ing, af­ter you loan me $1, about whether I re­turn you a dif­fer­ent or the same $1 coin. It just does­n't mat­ter. What you want is just a $1 coin, or a 2, or a 1000.

Al­so, the re­sult of 2 is 2 is im­ple­men­ta­tion de­pen­den­t. There is no rea­son, be­yond an op­ti­miza­tion, for that to be True.

Hop­ing this was clear, let me give you a last snip­pet:

>>> a = float('NaN')
>>> a is a
True
>>> a == a
False

UP­DATE: lots of fun and in­ter­est­ing com­ments about this post at red­dit and a small fol­lowup here


Contents © 2000-2023 Roberto Alsina