Skip to main content

Ralsina.Me — Roberto Alsina's website

Posts about planet

Smiljan, a Small Planet Generator

I main­tain a cou­ple of small "plan­et" sites. If you are not fa­mil­iar with plan­et­s, they are sites that ag­gre­gate RSS/Atom feeds for a group of peo­ple re­lat­ed some­how. It makes for a nice, sin­gle, the­mat­ic feed.

Re­cent­ly, when chang­ing them from one serv­er to an­oth­er, ev­ery­thing broke. Old posts were new, feeds that had not been up­dat­ed in 2 years were al­ways with all its posts on top... a dis­as­ter.

I could have gone to the old server, and start­ed de­bug­ging why raw­dog was do­ing that, or switch to plan­et, or look for oth­er soft­ware, or use an on­line ag­gre­ga­tor.

In­stead, I start­ed think­ing... I had writ­ten a few RSS ag­gre­ga­tors in the past... Feed­pars­er is again un­der ac­tive de­vel­op­men­t... raw­dog and plan­et seem to be pret­ty much aban­doned... how hard could it be to im­ple­ment the min­i­mal plan­et soft­ware?

Well, not all that hard, that's how hard it was. Like it took me 4 hours, and was not even dif­fi­cult.

One rea­son why this was eas­i­er than what plan­et and raw­dog achieved is that I am not do­ing a stat­ic site gen­er­a­tor, be­cause I al­ready have one so all I need this pro­gram (I called it Smil­jan) to do is:

  • Parse a list of feeds and store it in a data­base if need­ed.

  • Down­load those feeds (re­spec­t­ing etag and mod­­i­­fied-s­ince).

  • Parse those feeds look­ing for en­tries (feed­­pars­er does that).

  • Load those en­tries (or rather, a tiny sub­­set of their data) in the data­base.

  • Use the en­tries to gen­er­ate a set of files to feed Niko­la

  • Use niko­la to gen­er­ate and de­­ploy the site.

So, here is the fi­nal re­sult: http://­plan­e­ta.python.org.ar which still needs them­ing and a lot of oth­er stuff, but work­s.

I im­ple­ment­ed Smil­jan as 3 doit tasks, which makes it very easy to in­te­grate with Niko­la (if you know Niko­la: add "from smil­jan im­port *" in your do­do.py and a feeds file with the feed list in raw­dog for­mat) and voilá, run­ning this up­dates the plan­et:

doit load_feeds update_feeds generate_posts render_site deploy

Here is the code for smil­jan.py, cur­rent­ly at the "gross hack that kin­da work­s" stage. En­joy!

# -*- coding: utf-8 -*-
import codecs
import datetime
import glob
import os
import sys

from doit.tools import timeout
import feedparser
import peewee


class Feed(peewee.Model):
    name = peewee.CharField()
    url = peewee.CharField(max_length = 200)
    last_status = peewee.CharField()
    etag = peewee.CharField(max_length = 200)
    last_modified = peewee.DateTimeField()

class Entry(peewee.Model):
    date = peewee.DateTimeField()
    feed = peewee.ForeignKeyField(Feed)
    content = peewee.TextField(max_length = 20000)
    link = peewee.CharField(max_length = 200)
    title = peewee.CharField(max_length = 200)
    guid = peewee.CharField(max_length = 200)

Feed.create_table(fail_silently=True)
Entry.create_table(fail_silently=True)

def task_load_feeds():
    feeds = []
    feed = name = None
    for line in open('feeds'):
        line = line.strip()
        if line.startswith('feed'):
            feed = line.split(' ')[2]
        if line.startswith('define_name'):
            name = ' '.join(line.split(' ')[1:])
        if feed and name:
            feeds.append([feed, name])
            feed = name = None

    def add_feed(name, url):
        f = Feed.create(
            name=name,
            url=url,
            etag='caca',
            last_modified=datetime.datetime(1970,1,1),
            )
        f.save()

    def update_feed_url(feed, url):
        feed.url = url
        feed.save()

    for feed, name in feeds:
        f = Feed.select().where(name=name)
        if not list(f):
            yield {
                'name': name,
                'actions': ((add_feed,(name, feed)),),
                'file_dep': ['feeds'],
                }
        elif list(f)[0].url != feed:
            yield {
                'name': 'updating:'+name,
                'actions': ((update_feed_url,(list(f)[0], feed)),),
                }


def task_update_feeds():
    def update_feed(feed):
        modified = feed.last_modified.timetuple()
        etag = feed.etag
        parsed = feedparser.parse(feed.url,
            etag=etag,
            modified=modified
        )
        try:
            feed.last_status = str(parsed.status)
        except:  # Probably a timeout
            # TODO: log failure
            return
        if parsed.feed.get('title'):
            print parsed.feed.title
        else:
            print feed.url
        feed.etag = parsed.get('etag', 'caca')
        modified = tuple(parsed.get('date_parsed', (1970,1,1)))[:6]
        print "==========>", modified
        modified = datetime.datetime(*modified)
        feed.last_modified = modified
        feed.save()
        # No point in adding items from missinfg feeds
        if parsed.status > 400:
            # TODO log failure
            return
        for entry_data in parsed.entries:
            print "========================================="
            date = entry_data.get('updated_parsed', None)
            if date is None:
                date = entry_data.get('published_parsed', None)
            if date is None:
                print "Can't parse date from:"
                print entry_data
                return False
            date = datetime.datetime(*(date[:6]))
            title = "%s: %s" %(feed.name, entry_data.get('title', 'Sin título'))
            content = entry_data.get('description',
                    entry_data.get('summary', 'Sin contenido'))
            guid = entry_data.get('guid', entry_data.link)
            link = entry_data.link
            print repr([date, title])
            entry = Entry.get_or_create(
                date = date,
                title = title,
                content = content,
                guid=guid,
                feed=feed,
                link=link,
            )
            entry.save()
    for feed in Feed.select():
        yield {
            'name': feed.name.encode('utf8'),
            'actions': [(update_feed,(feed,))],
            'uptodate': [timeout(datetime.timedelta(minutes=20))],
            }

def task_generate_posts():

    def generate_post(entry):
        meta_path = os.path.join('posts',str(entry.id)+'.meta')
        post_path = os.path.join('posts',str(entry.id)+'.txt')
        with codecs.open(meta_path, 'wb+', 'utf8') as fd:
            fd.write(u'%s\n' % entry.title.replace('\n', ' '))
            fd.write(u'%s\n' % entry.id)
            fd.write(u'%s\n' % entry.date.strftime('%Y/%m/%d %H:%M'))
            fd.write(u'\n')
            fd.write(u'%s\n' % entry.link)
        with codecs.open(post_path, 'wb+', 'utf8') as fd:
            fd.write(u'.. raw:: html\n\n')
            content = entry.content
            if not content:
                content = u'Sin contenido'
            for line in content.splitlines():
                fd.write(u'    %s\n' % line)

    for entry in Entry.select().order_by(('date', 'desc')):
        yield {
            'name': entry.id,
            'actions': [(generate_post, (entry,))],
            }