Tartrazine: reimplementing pygments or my adventure in extreme test-driven-development

2024-08-05 11:59

This is a "what I did this weekend" post, but it's slightly more interesting than others, I think. So, I reimplemented a large chunk of Pygments in a lib called Tartrazine.

Why?

Because I wanted to highlight source code in Markterm, and I wanted to do it in Crystal, not using an external dependency.

I was using Chroma but it's running over a pipe and makes the code look ugly, and you need to install it, and so on.

So ... I knew Chroma was a Go port of Pygments. So I thought ... how hard can it be? They already did it!

Because I believe we need more libraries I just started writing the damned thing.

What?

Pygments/Chroma consists of three parts.

Lexers, which turn a text and turn it into a pile of tokens.
Styles, which when asked about a token type, return a color/bold/underline/etc. "style".
Formatters, which iterate a list of tokens, apply styles and create a stream of text (for example HTML with pretty colors).

The hard part seemed to be the lexers, so I started there.

How?

I lied a little. I started trying to read the Pygments code. It was quickly clear that there are several kinds of lexers, but most of them (like, 90%) are "regex lexers". They are lexers that use a state machine and a bunch of regular expressions to tokenize the input.

I know and have implemented state machines.. State machines are easy. So, I decided to just implement the regex lexers. They have the huge advantage that they have little to no code. THey are just a bunch of regular expressions and a bunch of rules that say "if you see this, do that".

They are implemented as data. Here's what the ada lexer looks like:

    tokens = {
        'root': [
            (r'[^\S\n]+', Text),
            (r'--.*?\n', Comment.Single),
            (r'[^\S\n]+', Text),
            (r'function|procedure|entry', Keyword.Declaration, 'subprogram'),
            (r'(subtype|type)(\s+)(\w+)',
             bygroups(Keyword.Declaration, Text, Keyword.Type), 'type_def'),
            (r'task|protected', Keyword.Declaration),
            (r'(subtype)(\s+)', bygroups(Keyword.Declaration, Text)),
            (r'(end)(\s+)', bygroups(Keyword.Reserved, Text), 'end'),
            (r'(pragma)(\s+)(\w+)', bygroups(Keyword.Reserved, Text,
                                             Comment.Preproc)),
            (r'(true|false|null)\b', Keyword.Constant),
            # builtin types
            (words(BUILTIN_LIST, suffix=r'\b'), Keyword.Type),
            (r'(and(\s+then)?|in|mod|not|or(\s+else)|rem)\b', Operator.Word),
            (r'generic|private', Keyword.Declaration),
            (r'package', Keyword.Declaration, 'package'),
            (r'array\b', Keyword.Reserved, 'array_def'),
            (r'(with|use)(\s+)', bygroups(Keyword.Namespace, Text), 'import'),
            (r'(\w+)(\s*)(:)(\s*)(constant)',
             bygroups(Name.Constant, Text, Punctuation, Text,
                      Keyword.Reserved)),
            (r'<<\w+>>', Name.Label),

While utterly unscrutable, that's just data. Then I looked at how Pygments processes that data, and it was bad news. While it's ok it's very idiomatic Python. Like, metaclasses and things jumping around the codebase. I had a feeling it couldn't be that hard.

After all, the excellent write your own lexer document explains it in about two pages of text!

So, I looked at Chroma's implementation. Let's say I am now distrustful of those who claim go code is simple, and worried I may be extremely dumb.

Sure, if I spent some time I could understand it, but I am not a go person, and I don't have plans to be one soon, so I had to make decisions.

And then I saw a magical folder...

A folder full of XML which is obviously the lexer definitions.

Chroma took the Pygments lexer definitions which were data in Python files, and turned them into data in actual data files.

And actually reading those XML files along the Pygments doc did the trick. I now know how to write a lexer.

But really, How?

Let's look at how, while looking at the definition of a very simple lexer, the "bash_session" one. A lexer, like I said, is a state machine. Each lexer has some metadata, such as its name, aliases, etc, and some instructions about how to process input.

In this case, it says input should end with a newline.

<lexer>
  <config>
    <name>Bash Session</name>
    <alias>bash-session</alias>
    <alias>console</alias>
    <alias>shell-session</alias>
    <filename>*.sh-session</filename>
    <mime_type>text/x-sh</mime_type>
    <ensure_nl>true</ensure_nl>
  </config>

Since a lexer is a state machine, it has states. The first state is always called root. Each state has rules. Because this lexer is very simple, it has only one state with two rules.

  <rules>
    <state name="root">
      <rule pattern="^((?:\[[^]]+@[^]]+\]\s?)?[#$%&gt;])(\s*)(.*\n?)">
        <bygroups>
          <token type="GenericPrompt"/>
          <token type="Text"/>
          <using lexer="bash"/>
        </bygroups>
      </rule>
      <rule pattern="^.+\n?">
        <token type="GenericOutput"/>
      </rule>
    </state>
  </rules>

Each rule has a pattern (a regular expression) which decides if the rule applies or not.

The first rule says "if the line starts with a prompt, capture the prompt, capture the spaces after it, and then capture the rest of the line".

Then, inside the rule, we have "actions". This rule has one action, which is "bygroups". This action says "the first group we capured is a GenericPrompt, the second group is Text, and the third group we should ask the bash lexer to tokenize".

And that makes sense, since a bash session looks like this:

$ echo hello
hello

There you have "$" (the prompt), " " (text), and "echo hello" (bash code).

The second rule is simpler. It says "capture a whole line".

So, when processing that example session, it works like this:

The state is "root" (it always starts there), and we look at the beginning of the file.

The first line matches the first rule, so we capture the prompt, the spaces, and the text. We generate the first two tokens: GenericPrompt and Text. Then we ask the bash lexer to tokenize the rest of the line. It will return a list of tokens, we keep those tokens too.

Because we matched, we move the "cursor" to the end of the match, which is at the beginning of the second line now.

And we start matching again.

The state is root. The first rule doesn't match at the position we're in. The second rule does. So we capture the whole line and generate a GenericOutput token. Move the cursor to the end of the match.

Oops, no more file. There, we tokenized.

Just that?

Well, no. Actions can also "push a state" which means change the current state to something else. States are kept in a stack, so if you were in state "root" and pushed the state "foo" now the stack is "root, foo" and the current state is "foo".

Of course you can also "pop" a state, which means "go back to the previous state".

There are some other bits, such as "include" which means "pretend the rules of the other lexer are here" so we don't have to write them many times in the XML, or that you can pop more than one state, whatever, the basic is just:

You are in a state
Check rules until one matches
Use that rule's actions (you may end up in another state)
Collect any tokens generated
Move the cursor to the end of the match
Go back to 1.

And that's it. That's how you write a lexer.

And then?

But suppose I wrote the lexer, how do I know if I am doing it right? I mean, I can't just run the lexer and see if it works, right?

Well, we could if we only had a whole pile of things to tokenize, and a tool that creates the tokens in a readable format!

Hey, we have those things. There is the pygments test suite and Chroma can output tokens in json!

So, let's do some extreme test-driven-development! After all, I have the tests written, now I just have to pass them, right?

I wrote enough lexer to spit out tokens, wrote a test rig that compared them to chroma's output, and started writing a lexer.

Hey, ya puedo parsear correctamente texto plano! https://t.co/wMv5HuLYKF pic.twitter.com/rgESeyk5Ta
— Roberto H. Alsina (@ralsina) August 3, 2024

That up there is a two-day-long thread of me trying to pass tests. When I finished, over 96% of the test suite was passing and most of the failures were arguable (I think chroma is wrong in some cases).

So, I had written the RegexLexer. Looks like this

That code supports 241 languages, and it's about 300 lines of simple code.

In fact, I think someone (not me) should do what I did but write this thing in C, so it can be used from any language and both chroma and tartrazine are rendered obsolete.

Ralsina.Me — Roberto Alsina's website