Skip to main content

Ralsina.Me — Roberto Alsina's website

A Simple Nikola Link Checker

One of the most im­por­tant things when you are build­ing a stat­ic site gen­er­a­tor like Niko­la is that your site should not be bro­ken. So, I re­al­ly should have done this ear­li­er ;-)

This is a very sim­ple link check­er that en­sures the pages Niko­la gen­er­ates have no bro­ken links. I will make it part of Niko­la prop­er once it's more pol­ished and doit sup­ports get­ting a list of tar­gets

To try it, get it and run it from the same place where you have your conf.py, right after you run doit.

import os
import urllib
from urlparse import urlparse

import lxml.html

def analyze(filename):
    try:
        # Use LXML to parse the HTML
        d = lxml.html.fromstring(open(filename).read())
        for l in d.iterlinks():
            # Get the target link
            target = l[0].attrib[l[1]]
            if target == "#":  # These are always valid
                continue
            parsed = urlparse(target)
            # We only handle relative links.
            # TODO: check if the URL points to inside the generated
            # site and check it anyway
            if parsed.scheme:
                continue
            # Ignore the fragment, since the link will still work
            # TODO: check that the fragment is valid
            if parsed.fragment:
                target = target.split('#')[0]
            # Calculate what file or folder this points to
            target_filename = os.path.abspath(
                os.path.join(os.path.dirname(filename), urllib.unquote(target)))
            # Check if it exists, or report it
            if not os.path.exists(target_filename):
                print "In %s broken link: " % filename, target
    except Exception as exc:
        # Something bad happened, report
        print "Error with:", filename, exc

# This is hackish: we use doit to get a list of all
# generated files. Minor modifications would let you check
# the non-generated files as well.

for task in os.popen('doit list --all', 'r').readlines():
    task = task.strip()
    if task.split(':')[0] in (
        'render_tags',
        'render_archive',
        'render_galleries',
        'render_indexes',
        'render_pages',
        'render_site') and '.html' in task:
            # It looks like a generated HTML file
            analyze(task.split(":")[-1])
David Buxton / 2012-06-30 10:39:

You won't find analize in a dictionary but it means something quite different to the word you want: analyse.

Roberto Alsina / 2012-06-30 12:30:

Hahaha oops!