---
author: ''
category: ''
date: 2012/02/17 20:34
description: ''
link: ''
priority: ''
slug: BB994
tags: programming, python
title: Scraping doesn't hurt
type: text
updated: 2012/02/17 20:34
url_type: ''
---
I am in general allergic to HTML, specially when it comes to parsing it. However, every now and then something comes up and it's fun to keep the muscles stretched.

So, consider the Ted Talks site. They have a `really nice table <http://www.ted.com/talks/quick-list>`_ with information about their talks, just in case you want to do something with them.

But how do you get that information? By scraping it. And what's an easy way to do it? By using Python and BeautifulSoup:

.. code-block:: python

	from BeautifulSoup import BeautifulSoup
	import urllib

	# Read the whole page.
	data = urllib.urlopen('http://www.ted.com/talks/quick-list').read()
	# Parse it
	soup = BeautifulSoup(data)

	# Find the table with the data
	table = soup.findAll('table', attrs= {"class": "downloads notranslate"})[0]
	# Get the rows, skip the first one
	rows = table.findAll('tr')[1:]

	items = []
	# For each row, get the data
	# And store it somewhere
	for row in rows:
	    cells = row.findAll('td')
	    item = {}
	    item['date'] = cells[0].text
	    item['event'] = cells[1].text
	    item['title'] = cells[2].text
	    item['duration'] = cells[3].text
	    item['links'] = [a['href'] for a in cells[4].findAll('a')]
	    items.append(item)

And that's it! Surprisingly pain-free!