--- author: '' category: '' date: 2012/02/17 20:34 description: '' link: '' priority: '' slug: BB994 tags: programming, python title: Scraping doesn't hurt type: text updated: 2012/02/17 20:34 url_type: '' --- I am in general allergic to HTML, specially when it comes to parsing it. However, every now and then something comes up and it's fun to keep the muscles stretched. So, consider the Ted Talks site. They have a `really nice table `_ with information about their talks, just in case you want to do something with them. But how do you get that information? By scraping it. And what's an easy way to do it? By using Python and BeautifulSoup: .. code-block:: python from BeautifulSoup import BeautifulSoup import urllib # Read the whole page. data = urllib.urlopen('http://www.ted.com/talks/quick-list').read() # Parse it soup = BeautifulSoup(data) # Find the table with the data table = soup.findAll('table', attrs= {"class": "downloads notranslate"})[0] # Get the rows, skip the first one rows = table.findAll('tr')[1:] items = [] # For each row, get the data # And store it somewhere for row in rows: cells = row.findAll('td') item = {} item['date'] = cells[0].text item['event'] = cells[1].text item['title'] = cells[2].text item['duration'] = cells[3].text item['links'] = [a['href'] for a in cells[4].findAll('a')] items.append(item) And that's it! Surprisingly pain-free!