Skip to main content

Ralsina.Me — Roberto Alsina's website

Tartrazine: reimplementing pygments or my adventure in extreme test-driven-development

This is a "what I did this week­end" post, but it's slight­ly more in­ter­est­ing than oth­er­s, I think. So, I reim­ple­ment­ed a large chunk of Pyg­ments in a lib called Tar­trazine.

Why?

Be­cause I want­ed to high­light source code in Mark­term, and I want­ed to do it in Crys­tal, not us­ing an ex­ter­nal de­pen­den­cy.

I was us­ing Chro­ma but it's run­ning over a pipe and makes the code look ug­ly, and you need to in­stall it, and so on.

So ... I knew Chro­ma was a Go port of Pyg­ments. So I thought ... how hard can it be? They al­ready did it!

Be­cause I be­lieve we need more li­braries I just start­ed writ­ing the damned thing.

What?

Pyg­ments/Chro­ma con­sists of three part­s.

  • Lex­er­s, which turn a text and turn it in­to a pile of to­ken­s.
  • Styles, which when asked about a to­ken type, re­turn a col­or/bold/un­der­line/etc. "style".
  • For­mat­ter­s, which it­er­ate a list of to­ken­s, ap­ply styles and cre­ate a stream of text (for ex­am­ple HTML with pret­ty col­ors).

The hard part seemed to be the lex­er­s, so I start­ed there.

How?

I lied a lit­tle. I start­ed try­ing to read the Pyg­ments code. It was quick­ly clear that there are sev­er­al kinds of lex­er­s, but most of them (like, 90%) are "regex lex­er­s". They are lex­ers that use a state ma­chine and a bunch of reg­u­lar ex­pres­sions to to­k­enize the in­put.

I know and have im­ple­ment­ed state ma­chines.. State ma­chines are easy. So, I de­cid­ed to just im­ple­ment the regex lex­er­s. They have the huge ad­van­tage that they have lit­tle to no code. THey are just a bunch of reg­u­lar ex­pres­sions and a bunch of rules that say "if you see this, do that".

They are im­ple­ment­ed as da­ta. Here's what the ada lex­er looks like:

    tokens = {
        'root': [
            (r'[^\S\n]+', Text),
            (r'--.*?\n', Comment.Single),
            (r'[^\S\n]+', Text),
            (r'function|procedure|entry', Keyword.Declaration, 'subprogram'),
            (r'(subtype|type)(\s+)(\w+)',
             bygroups(Keyword.Declaration, Text, Keyword.Type), 'type_def'),
            (r'task|protected', Keyword.Declaration),
            (r'(subtype)(\s+)', bygroups(Keyword.Declaration, Text)),
            (r'(end)(\s+)', bygroups(Keyword.Reserved, Text), 'end'),
            (r'(pragma)(\s+)(\w+)', bygroups(Keyword.Reserved, Text,
                                             Comment.Preproc)),
            (r'(true|false|null)\b', Keyword.Constant),
            # builtin types
            (words(BUILTIN_LIST, suffix=r'\b'), Keyword.Type),
            (r'(and(\s+then)?|in|mod|not|or(\s+else)|rem)\b', Operator.Word),
            (r'generic|private', Keyword.Declaration),
            (r'package', Keyword.Declaration, 'package'),
            (r'array\b', Keyword.Reserved, 'array_def'),
            (r'(with|use)(\s+)', bygroups(Keyword.Namespace, Text), 'import'),
            (r'(\w+)(\s*)(:)(\s*)(constant)',
             bygroups(Name.Constant, Text, Punctuation, Text,
                      Keyword.Reserved)),
            (r'<<\w+>>', Name.Label),

While ut­ter­ly un­scrutable, that's just da­ta. Then I looked at how Pyg­ments pro­cess­es that data, and it was bad news. While it's ok it's very id­iomat­ic Python. Like, meta­class­es and things jump­ing around the code­base. I had a feel­ing it could­n't be that hard.

Af­ter al­l, the ex­cel­lent write your own lex­er doc­u­ment ex­plains it in about two pages of tex­t!

So, I looked at Chro­ma's im­ple­men­ta­tion. Let's say I am now dis­trust­ful of those who claim go code is sim­ple, and wor­ried I may be ex­treme­ly dum­b.

Sure, if I spent some time I could un­der­stand it, but I am not a go per­son, and I don't have plans to be one soon, so I had to make de­ci­sion­s.

And then I saw a mag­i­cal fold­er...

A fold­er full of XML which is ob­vi­ous­ly the lex­er def­i­ni­tion­s.

Chro­ma took the Pyg­ments lex­er def­i­ni­tions which were da­ta in Python files, and turned them in­to da­ta in ac­tu­al da­ta files.

And ac­tu­al­ly read­ing those XML files along the Pyg­ments doc did the trick. I now know how to write a lex­er.

But really, How?

Let's look at how, while look­ing at the def­i­ni­tion of a very sim­ple lex­er, the "bash_ses­sion" one. A lex­er, like I said, is a state ma­chine. Each lex­er has some meta­data, such as its name, alias­es, etc, and some in­struc­tions about how to process in­put.

In this case, it says in­put should end with a new­line.

<lexer>
  <config>
    <name>Bash Session</name>
    <alias>bash-session</alias>
    <alias>console</alias>
    <alias>shell-session</alias>
    <filename>*.sh-session</filename>
    <mime_type>text/x-sh</mime_type>
    <ensure_nl>true</ensure_nl>
  </config>

Since a lex­er is a state ma­chine, it has states. The first state is al­ways called root. Each state has rules. Be­cause this lex­er is very sim­ple, it has on­ly one state with two rules.

  <rules>
    <state name="root">
      <rule pattern="^((?:\[[^]]+@[^]]+\]\s?)?[#$%&gt;])(\s*)(.*\n?)">
        <bygroups>
          <token type="GenericPrompt"/>
          <token type="Text"/>
          <using lexer="bash"/>
        </bygroups>
      </rule>
      <rule pattern="^.+\n?">
        <token type="GenericOutput"/>
      </rule>
    </state>
  </rules>

Each rule has a pat­tern (a reg­u­lar ex­pres­sion) which de­cides if the rule ap­plies or not.

The first rule says "if the line starts with a promp­t, cap­ture the promp­t, cap­ture the spa­ces af­ter it, and then cap­ture the rest of the line".

Then, in­side the rule, we have "ac­tion­s". This rule has one ac­tion, which is "by­group­s". This ac­tion says "the first group we ca­pured is a Gener­icPromp­t, the sec­ond group is Tex­t, and the third group we should ask the bash lex­er to to­k­enize".

And that makes sense, since a bash ses­sion looks like this:

$ echo hello
hello

There you have "$" (the promp­t), " " (tex­t), and "e­cho hel­lo" (bash code).

The sec­ond rule is sim­pler. It says "cap­ture a whole line".

So, when pro­cess­ing that ex­am­ple ses­sion, it works like this:

The state is "root" (it al­ways starts there), and we look at the be­gin­ning of the file.

The first line match­es the first rule, so we cap­ture the promp­t, the spaces, and the tex­t. We gen­er­ate the first two to­ken­s: Gener­icPrompt and Tex­t. Then we ask the bash lex­er to to­k­enize the rest of the line. It will re­turn a list of to­ken­s, we keep those to­kens too.

Be­cause we matched, we move the "cur­sor" to the end of the match, which is at the be­gin­ning of the sec­ond line now.

And we start match­ing again.

The state is root. The first rule does­n't match at the po­si­tion we're in. The sec­ond rule does. So we cap­ture the whole line and gen­er­ate a Gener­i­cOut­put to­ken. Move the cur­sor to the end of the match.

Oop­s, no more file. There, we to­k­enized.

Just that?

Well, no. Ac­tions can al­so "push a state" which means change the cur­rent state to some­thing else. States are kept in a stack, so if you were in state "root" and pushed the state "foo" now the stack is "root, foo" and the cur­rent state is "foo".

Of course you can al­so "pop" a state, which means "go back to the pre­vi­ous state".

There are some oth­er bit­s, such as "in­clude" which means "pre­tend the rules of the oth­er lex­er are here" so we don't have to write them many times in the XM­L, or that you can pop more than one state, what­ev­er, the ba­sic is just:

  1. You are in a state
  2. Check rules un­til one match­es
  3. Use that rule's ac­tions (y­ou may end up in an­oth­er state)
  4. Col­lect any to­kens gen­er­at­ed
  5. Move the cur­sor to the end of the match
  6. Go back to 1.

And that's it. That's how you write a lex­er.

And then?

But sup­pose I wrote the lex­er, how do I know if I am do­ing it right? I mean, I can't just run the lex­er and see if it work­s, right?

Well, we could if we on­ly had a whole pile of things to to­k­enize, and a tool that cre­ates the to­kens in a read­able for­mat!

Hey, we have those things. There is the pyg­ments test suite and Chro­ma can out­put to­kens in json!

So, let's do some ex­treme test-­driven-de­vel­op­men­t! Af­ter al­l, I have the tests writ­ten, now I just have to pass them, right?

I wrote enough lex­er to spit out to­ken­s, wrote a test rig that com­pared them to chro­ma's out­put, and start­ed writ­ing a lex­er.

That up there is a two-­day-­long thread of me try­ing to pass test­s. When I fin­ished, over 96% of the test suite was pass­ing and most of the fail­ures were ar­guable (I think chro­ma is wrong in some cas­es).

So, I had writ­ten the RegexLex­er. Looks like this

That code sup­ports 241 lan­guages, and it's about 300 lines of sim­ple code.

In fac­t, I think some­one (not me) should do what I did but write this thing in C, so it can be used from any lan­guage and both chro­ma and tar­trazine are ren­dered ob­so­lete.


Contents © 2000-2024 Roberto Alsina