Ir al contenido principal

Ralsina.Me — El sitio web de Roberto Alsina

Publicaciones sobre tartrazine

A Tale of Optimization

I reim­ple­ment­ed Pyg­ments in Crys­tal. It did­n't quite go as I ex­pect­ed. I have al­ready writ­ten about how I did it but that left a large part of the sto­ry un­told. You see, I am us­ing Crys­tal, which com­piles to na­tive code. And yet my reim­ple­men­ta­tion was slow­er than Python. That's not sup­posed to hap­pen.

I de­cid­ed to fig­ure out why, and fix it. This is the sto­ry of how I made my soft­ware that looked "ok" 30x faster. Mind you, this is go­ing to make me sound much smarter and ac­cu­rate than I am. I ran in­to 200 dead ends be­fore find­ing each im­prove­men­t.

The First Problem (v0.1.0)

Command Mean [ms] Min [ms] Max [ms] Relative
bin/tartrazine ../crycco/src/crycco.cr swapoff > x.html --standalone 533.8 ± 4.4 523.7 542.9 18.80 ± 0.92
chroma ../crycco/src/crycco.cr -l crystal -f html -s swapoff > x.html 28.4 ± 1.4 25.6 32.8 1.00
pygmentize ../crycco/src/crycco.cr -l crystal -O full,style=autumn -f html -o x.html 103.5 ± 2.8 95.6 109.1 3.65 ± 0.20

That bench­mark (like all the rest) is done us­ing hy­per­fine and run­ning each com­mand 50 times af­ter a 10-run warmup. Not that it needs so much care, just look at those num­ber­s. Not on­ly is tar­trazine al­most 20 times slow­er than chro­ma, it's al­so 3.5 times slow­er than Pyg­ments. And Pyg­ments is writ­ten in Python!

Even with­out com­par­ing, half a sec­ond to high­light a 100-­line file is ridicu­lous.

What's going on here? To find out, let's get data. I used callgrind to profile the code, and then kcachegrind to visualize it.

$ valgrind --tool=callgrind bin/tartrazine ../crycco/src/crycco.cr swapoff

Some functions called half a billion times

As you can see, some func­tions are called half a bil­lion times and ac­count for 40% of the ex­e­cu­tion time. What are they?

A string in Crystal is always unicode. The String::char_bytesize_at function is used to convert an offset into the string from characters to bytes. That's because unicode characters can be different "widths" in bytes. So in the string "123" the "3" is the 3rd byte, but in "áéí" the "í" starts in the 5th byte.

And why is it do­ing that? Be­cause this code does a bazil­lion regex op­er­a­tions, and the un­der­ly­ing li­brary (PCRE2) deals in bytes, not char­ac­ter­s, so ev­ery­time we need to do things like "match this regex in­to this string start­ing in the 9th po­si­tion" we need to con­vert that off­set to bytes, and then when it finds a match at byte X we need to con­vert that off­set to char­ac­ters and to ex­tract the da­ta we do it two more times, and so on.

I decided to ... not do that. One nice thing of Crystal is that even though it's compiled, the whole standard library is there in /usr/lib/crystal so I could just go and read how the regular expression code was implemented and see what to do.

Ended up writing a version of Regex and Regex.match that worked on bytes, and made my code deal with bytes instead of characters. I only needed to convert into strings when I had already generated a token, rather than in the thousands of failed attempts to generate it.

The Second Problem (commit 0626c86)

Command Mean [ms] Min [ms] Max [ms] Relative
bin/tartrazine ../crycco/src/crycco.cr -f html -t swapoff -o x.html --standalone 187.8 ± 4.4 183.2 204.1 7.48 ± 0.30
chroma ../crycco/src/crycco.cr -l crystal -f html -s swapoff > x.html 25.1 ± 0.8 23.6 28.5 1.00
pygmentize ../crycco/src/crycco.cr -l crystal -O full,style=autumn -f html -o x.html 89.9 ± 4.7 83.6 102.1 3.58 ± 0.22

While bet­ter this still suck­s. I made it 2.5 times faster, but it's still 7.5 times slow­er than chro­ma, and 3.6x slow­er than Python???

regex library takes all the time

Back to val­grind. This time, the pro­file was ... dif­fer­en­t. The regex li­brary was tak­ing al­most all the ex­e­cu­tion time, which makes sense, since it's do­ing all the work. But why is it so slow?

It's calling valid_utf8 ALL THE TIME?

Almost all the time is spent in valid_utf8. That's a function that checks if a string is valid UTF-8. And it's called all the time. Why? Because the regex library is written in C, and it doesn't know that the strings it's working with are already valid UTF-8. So, it checks. And checks. And checks.

Solution? Let's not do that either. The PCRE2 library has a handy flag called NO_UTF_CHECK just for that. So, if you pass that flag when you are doing a regex match, it will not call valid_utf8 at all!

The Third Problem (commit 7db8fdc)

Command Mean [ms] Min [ms] Max [ms] Relative
bin/tartrazine ../crycco/src/crycco.cr -f html -t swapoff -o x.html --standalone 30.0 ± 2.2 25.5 36.6 1.15 ± 0.10
chroma ../crycco/src/crycco.cr -l crystal -f html -s swapoff > x.html 26.0 ± 1.0 24.1 29.2 1.00
pygmentize ../crycco/src/crycco.cr -l crystal -O full,style=autumn -f html -o x.html 96.3 ± 7.7 86.1 125.3 3.70 ± 0.33

Yay! My com­piled pro­gram is fi­nal­ly faster than an in­ter­pret­ed one! And even in the same ball­park as the oth­er com­piled one!

I won­der what hap­pens with a larg­er file!

Command Mean [ms] Min [ms] Max [ms] Relative
bin/tartrazine /usr/include/sqlite3.h -f html -t swapoff -o x.html --standalone -l c 896.6 ± 71.6 709.6 1015.8 7.07 ± 0.58
chroma /usr/include/sqlite3.h -l c -f html -s swapoff > x.html 126.8 ± 2.1 122.8 132.9 1.00
pygmentize /usr/include/sqlite3.h -l c -O full,style=autumn -f html -o x.html 229.1 ± 4.5 219.5 238.9 1.81 ± 0.05

Clear­ly some­thing very bad hap­pens when my code pro­cess­es larg­er files. I won­der what it is?

That's rather low level

At first glance this is not very informative. So, most of the execution time is spent in libgc and libc functions. That's not very helpful. But, if you look a bit harder, you'll see the execution time is spent allocating memory. Hundreds of milliseconds spent allocating memory

But memory is fast!

Yes, memory is fast. But allocating it is not. And I was allocating a lot of memory. Or rather, I was allocating memory over and over. See that memcpy there?

This took me about a day to fig­ure out, but this line of code is where ev­ery­thing be­came slow.

matched, new_pos, new_tokens = rule.match(text_bytes, pos, self)
if matched
    # Move position forward, save the tokens,
    # tokenize from the new position
    pos = new_pos
    tokens += new_tokens

That looks innocent, right? It's not. The tokens array is created at the beginning of tokenization, and every time I find new tokens I just append them to it. BUT ... how does that work? Arrays are not infinite! Where does it put the new tokens?

Well, when an ar­ray grows, it al­lo­cates a new, larg­er ar­ray, copies all the el­e­ments there and now you have a larg­er ar­ray with some room to spare. When you keep do­ing that, you are copy­ing the whole ar­ray over and over. It's the cost you pay for the con­ve­nience of hav­ing an ar­ray that can grow.

Where was I call­ing this? In the for­mat­ter. The for­mat­ter needs to see the to­kens to turn them in­to HTM­L. Here's the rel­e­vant code:

lines = lexer.group_tokens_in_lines(lexer.tokenize(text))

All it does is group them in­to lines (BTW, that again does the whole "grow an ar­ray" thing) and then it just it­er­ates the lines, then the to­kens in each line, slaps some HTML around them and writes them to a string (which again, grows).

The so­lu­tion is ... not to do that.

In Crys­tal we have it­er­a­tors, so I changed the to­k­eniz­er so that rather than re­turn­ing an ar­ray of to­kens it's an it­er­a­tor which gen­er­ates them one af­ter the oth­er. So the for­mat­ter looks more like this:

tokenizer.each do |token|
    outp << "<span class=\"#{get_css_class(token[:type])}\">#{HTML.escape(token[:value])}</span>"
    if token[:value].ends_with? "\n"
    i += 1
    outp << line_label(i) if line_numbers?
    end
end

Rather than iterate an array, it iterates ... an iterator. So no tokens array in the middle. Instead of grouping them into lines, spit out a line label after every newline character (yes, this is a bit more complicated under the hood)

There were sev­er­al oth­er op­ti­miza­tions but this is the one that made the dif­fer­ence.

The Fourth Problem (commit ae03e46)

Command Mean [ms] Min [ms] Max [ms] Relative
bin/tartrazine /usr/include/sqlite3.h -f html -t swapoff -o x.html --standalone -l c 73.5 ± 5.9 67.7 104.8 1.00
chroma /usr/include/sqlite3.h -l c -f html -s swapoff > x.html 123.1 ± 2.7 117.5 130.3 1.68 ± 0.14
pygmentize /usr/include/sqlite3.h -l c -O full,style=autumn -f html -o x.html 222.0 ± 5.9 207.8 239.4 3.02 ± 0.26

Fi­nal­ly, tar­trazine is the fastest one. On a large file! By a good margin! But is that all there is? Is there noth­ing else to im­prove? Well, no. I can do the same trick again!

You see, the formatter is returning a String by appending to it, and then we are writing the string to a file. That's the same problem as before!. So, I changed the formatter to take an IO object and write to it directly.

Command Mean [ms] Min [ms] Max [ms] Relative
bin/tartrazine /usr/include/sqlite3.h -f html -t swapoff -o x.html --standalone -l c 70.3 ± 1.8 65.6 73.8 1.00
chroma /usr/include/sqlite3.h -l c -f html -s swapoff > x.html 122.2 ± 3.0 116.6 131.1 1.74 ± 0.06
pygmentize /usr/include/sqlite3.h -l c -O full,style=autumn -f html -o x.html 219.4 ± 4.4 212.2 235.5 3.12 ± 0.10

As you can see, that still gives a small im­prove­men­t, of just 3 mil­lisec­ond­s. But that's 5% of the to­tal time. And it's a small change.

And this is where di­min­ish­ing re­turns hit. I could prob­a­bly make it faster, but even a 10% im­prove­ment would be just 7 mil­lisec­onds on a huge file. If I were GitHub then maybe this would be worth my time, but I am not and it's not.

And how does the fi­nal ver­sion com­pare with the first one?

Command Mean [ms] Min [ms] Max [ms] Relative
bin/tartrazine-last ../crycco/src/crycco.cr -f html -t swapoff -o x.html --standalone -l c 16.1 ± 1.1 13.1 21.8 1.00
bin/tartrazine-first ../crycco/src/crycco.cr swapoff 519.1 ± 5.4 508.1 533.7 32.29 ± 2.23

A speedup fac­tor of 32.3x, for code that had noth­ing ob­vi­ous­ly wrong in it. I'd say that's pret­ty good.

Tartrazine: reimplementing pygments or my adventure in extreme test-driven-development

This is a "what I did this week­end" post, but it's slight­ly more in­ter­est­ing than oth­er­s, I think. So, I reim­ple­ment­ed a large chunk of Pyg­ments in a lib called Tar­trazine.

Why?

Be­cause I want­ed to high­light source code in Mark­term, and I want­ed to do it in Crys­tal, not us­ing an ex­ter­nal de­pen­den­cy.

I was us­ing Chro­ma but it's run­ning over a pipe and makes the code look ug­ly, and you need to in­stall it, and so on.

So ... I knew Chro­ma was a Go port of Pyg­ments. So I thought ... how hard can it be? They al­ready did it!

Be­cause I be­lieve we need more li­braries I just start­ed writ­ing the damned thing.

What?

Pyg­ments/Chro­ma con­sists of three part­s.

  • Lex­er­s, which turn a text and turn it in­to a pile of to­ken­s.
  • Styles, which when asked about a to­ken type, re­turn a col­or/bold/un­der­line/etc. "style".
  • For­mat­ter­s, which it­er­ate a list of to­ken­s, ap­ply styles and cre­ate a stream of text (for ex­am­ple HTML with pret­ty col­ors).

The hard part seemed to be the lex­er­s, so I start­ed there.

How?

I lied a lit­tle. I start­ed try­ing to read the Pyg­ments code. It was quick­ly clear that there are sev­er­al kinds of lex­er­s, but most of them (like, 90%) are "regex lex­er­s". They are lex­ers that use a state ma­chine and a bunch of reg­u­lar ex­pres­sions to to­k­enize the in­put.

I know and have im­ple­ment­ed state ma­chines.. State ma­chines are easy. So, I de­cid­ed to just im­ple­ment the regex lex­er­s. They have the huge ad­van­tage that they have lit­tle to no code. THey are just a bunch of reg­u­lar ex­pres­sions and a bunch of rules that say "if you see this, do that".

They are im­ple­ment­ed as da­ta. Here's what the ada lex­er looks like:

    tokens = {
        'root': [
            (r'[^\S\n]+', Text),
            (r'--.*?\n', Comment.Single),
            (r'[^\S\n]+', Text),
            (r'function|procedure|entry', Keyword.Declaration, 'subprogram'),
            (r'(subtype|type)(\s+)(\w+)',
             bygroups(Keyword.Declaration, Text, Keyword.Type), 'type_def'),
            (r'task|protected', Keyword.Declaration),
            (r'(subtype)(\s+)', bygroups(Keyword.Declaration, Text)),
            (r'(end)(\s+)', bygroups(Keyword.Reserved, Text), 'end'),
            (r'(pragma)(\s+)(\w+)', bygroups(Keyword.Reserved, Text,
                                             Comment.Preproc)),
            (r'(true|false|null)\b', Keyword.Constant),
            # builtin types
            (words(BUILTIN_LIST, suffix=r'\b'), Keyword.Type),
            (r'(and(\s+then)?|in|mod|not|or(\s+else)|rem)\b', Operator.Word),
            (r'generic|private', Keyword.Declaration),
            (r'package', Keyword.Declaration, 'package'),
            (r'array\b', Keyword.Reserved, 'array_def'),
            (r'(with|use)(\s+)', bygroups(Keyword.Namespace, Text), 'import'),
            (r'(\w+)(\s*)(:)(\s*)(constant)',
             bygroups(Name.Constant, Text, Punctuation, Text,
                      Keyword.Reserved)),
            (r'<<\w+>>', Name.Label),

While ut­ter­ly un­scrutable, that's just da­ta. Then I looked at how Pyg­ments pro­cess­es that data, and it was bad news. While it's ok it's very id­iomat­ic Python. Like, meta­class­es and things jump­ing around the code­base. I had a feel­ing it could­n't be that hard.

Af­ter al­l, the ex­cel­lent write your own lex­er doc­u­ment ex­plains it in about two pages of tex­t!

So, I looked at Chro­ma's im­ple­men­ta­tion. Let's say I am now dis­trust­ful of those who claim go code is sim­ple, and wor­ried I may be ex­treme­ly dum­b.

Sure, if I spent some time I could un­der­stand it, but I am not a go per­son, and I don't have plans to be one soon, so I had to make de­ci­sion­s.

And then I saw a mag­i­cal fold­er...

A fold­er full of XML which is ob­vi­ous­ly the lex­er def­i­ni­tion­s.

Chro­ma took the Pyg­ments lex­er def­i­ni­tions which were da­ta in Python files, and turned them in­to da­ta in ac­tu­al da­ta files.

And ac­tu­al­ly read­ing those XML files along the Pyg­ments doc did the trick. I now know how to write a lex­er.

But really, How?

Let's look at how, while look­ing at the def­i­ni­tion of a very sim­ple lex­er, the "bash_ses­sion" one. A lex­er, like I said, is a state ma­chine. Each lex­er has some meta­data, such as its name, alias­es, etc, and some in­struc­tions about how to process in­put.

In this case, it says in­put should end with a new­line.

<lexer>
  <config>
    <name>Bash Session</name>
    <alias>bash-session</alias>
    <alias>console</alias>
    <alias>shell-session</alias>
    <filename>*.sh-session</filename>
    <mime_type>text/x-sh</mime_type>
    <ensure_nl>true</ensure_nl>
  </config>

Since a lex­er is a state ma­chine, it has states. The first state is al­ways called root. Each state has rules. Be­cause this lex­er is very sim­ple, it has on­ly one state with two rules.

  <rules>
    <state name="root">
      <rule pattern="^((?:\[[^]]+@[^]]+\]\s?)?[#$%&gt;])(\s*)(.*\n?)">
        <bygroups>
          <token type="GenericPrompt"/>
          <token type="Text"/>
          <using lexer="bash"/>
        </bygroups>
      </rule>
      <rule pattern="^.+\n?">
        <token type="GenericOutput"/>
      </rule>
    </state>
  </rules>

Each rule has a pat­tern (a reg­u­lar ex­pres­sion) which de­cides if the rule ap­plies or not.

The first rule says "if the line starts with a promp­t, cap­ture the promp­t, cap­ture the spa­ces af­ter it, and then cap­ture the rest of the line".

Then, in­side the rule, we have "ac­tion­s". This rule has one ac­tion, which is "by­group­s". This ac­tion says "the first group we ca­pured is a Gener­icPromp­t, the sec­ond group is Tex­t, and the third group we should ask the bash lex­er to to­k­enize".

And that makes sense, since a bash ses­sion looks like this:

$ echo hello
hello

There you have "$" (the promp­t), " " (tex­t), and "e­cho hel­lo" (bash code).

The sec­ond rule is sim­pler. It says "cap­ture a whole line".

So, when pro­cess­ing that ex­am­ple ses­sion, it works like this:

The state is "root" (it al­ways starts there), and we look at the be­gin­ning of the file.

The first line match­es the first rule, so we cap­ture the promp­t, the spaces, and the tex­t. We gen­er­ate the first two to­ken­s: Gener­icPrompt and Tex­t. Then we ask the bash lex­er to to­k­enize the rest of the line. It will re­turn a list of to­ken­s, we keep those to­kens too.

Be­cause we matched, we move the "cur­sor" to the end of the match, which is at the be­gin­ning of the sec­ond line now.

And we start match­ing again.

The state is root. The first rule does­n't match at the po­si­tion we're in. The sec­ond rule does. So we cap­ture the whole line and gen­er­ate a Gener­i­cOut­put to­ken. Move the cur­sor to the end of the match.

Oop­s, no more file. There, we to­k­enized.

Just that?

Well, no. Ac­tions can al­so "push a state" which means change the cur­rent state to some­thing else. States are kept in a stack, so if you were in state "root" and pushed the state "foo" now the stack is "root, foo" and the cur­rent state is "foo".

Of course you can al­so "pop" a state, which means "go back to the pre­vi­ous state".

There are some oth­er bit­s, such as "in­clude" which means "pre­tend the rules of the oth­er lex­er are here" so we don't have to write them many times in the XM­L, or that you can pop more than one state, what­ev­er, the ba­sic is just:

  1. You are in a state
  2. Check rules un­til one match­es
  3. Use that rule's ac­tions (y­ou may end up in an­oth­er state)
  4. Col­lect any to­kens gen­er­at­ed
  5. Move the cur­sor to the end of the match
  6. Go back to 1.

And that's it. That's how you write a lex­er.

And then?

But sup­pose I wrote the lex­er, how do I know if I am do­ing it right? I mean, I can't just run the lex­er and see if it work­s, right?

Well, we could if we on­ly had a whole pile of things to to­k­enize, and a tool that cre­ates the to­kens in a read­able for­mat!

Hey, we have those things. There is the pyg­ments test suite and Chro­ma can out­put to­kens in json!

So, let's do some ex­treme test-­driven-de­vel­op­men­t! Af­ter al­l, I have the tests writ­ten, now I just have to pass them, right?

I wrote enough lex­er to spit out to­ken­s, wrote a test rig that com­pared them to chro­ma's out­put, and start­ed writ­ing a lex­er.

That up there is a two-­day-­long thread of me try­ing to pass test­s. When I fin­ished, over 96% of the test suite was pass­ing and most of the fail­ures were ar­guable (I think chro­ma is wrong in some cas­es).

So, I had writ­ten the RegexLex­er. Looks like this

That code sup­ports 241 lan­guages, and it's about 300 lines of sim­ple code.

In fac­t, I think some­one (not me) should do what I did but write this thing in C, so it can be used from any lan­guage and both chro­ma and tar­trazine are ren­dered ob­so­lete.


Contents © 2000-2024 Roberto Alsina