Skip to main content

Ralsina.Me — Roberto Alsina's website

Posts about parser

Writing a simple parser using a finite state machine

I wrote a lit­er­ate pro­gram­ming tool called Cryc­co and at its core is a very sim­ple pars­er which ex­tracts com­ment blocks from source code and or­ga­nizes them in­to a doc­u­men­t.

If you want to see an ex­am­ple, Cryc­co's web­site is made out of its own code by pro­cess­ing it with it­self.

The for­mat is very sim­ple: you write a com­ment block, then some code, then an­oth­er com­ment block, and so on. The com­ment blocks are tak­en as mark­down, and the code blocks are tak­en as source code.

The doc­u­ment is then split in sec­tion­s: 1 chunk of doc, 1 chunk of code, and pro­cessed to make it look nice, with syn­tax high­light­ing and so on.

The pars­er was just a cou­ple dozen lines of code, but when I want­ed to add more fea­tures I ran in­to a wal­l, so I rewrote it us­ing a fi­nite state ma­chine (F­S­M) ap­proach.

Since, again, Cryc­co is lit­er­ate code, you can see the pars­er by just look­ing at the source code of the pars­er it­self, which is in the Doc­u­ment class

Of course, that may not be enough, so let's go in­to de­tail­s.

The state ma­chine has a few states:

    enum State
      CommentBlock
      EnclosingCommentBlock
      CodeBlock
    end

A com­ment block is some­thing like this:

    # This is a comment
    # More comment

An en­clos­ing com­ment block is a "mul­ti­line" com­men­t, like this:

 /* This is a comment
 More comment
 */

Code blocks are lines that are not com­ments :-)

So, suppose you are in the CommentBlock state, and you see a line that starts with # you stay in the same state.

If you see a line that does not start with #, you switch to the CodeBlock state.

When you are in the CommentBlock state, the line you are in is a comment. If you are in the CodeBlock state, the line is code.

Here are the pos­si­ble tran­si­tions in this ma­chine:

    state_machine State, initial: State::CodeBlock do
      event :comment, from: [State::CodeBlock], to: State::CommentBlock
      event :enclosing_comment_start, from: [State::CodeBlock], to: State::EnclosingCommentBlock
      event :enclosing_comment_end, from: [State::EnclosingCommentBlock], to: State::CodeBlock
      event :code, from: [State::CommentBlock], to: State::CodeBlock
    end

Then, to parse the doc­u­men­t, we go line by line, and call the ap­pro­pri­ate event de­pend­ing on the line we are read­ing. That event may change the state or not.

For ex­am­ple:

        if is_comment.match(line) && !NOT_COMMENT.match(line)
          self.comment {
            # These blocks only execute when transitions are successful.
            #
            # So, this block is executed when we are transitioning
            # to a comment block, which means we are starting
            # a new section
            @sections << Section.new(@language)
          }
          # Because the docs section is supposed to be markdown, we need
          # to remove the comment marker from the line.
          processed_line = processed_line.sub(@language.match, "") unless @literate

And that's it! We send the prop­er events to the ma­chine, the ma­chine changes state, we han­dle the line ac­cord­ing to what state we are in, and we end up with a nice­ly parsed doc­u­men­t.

Parsers are some­what scary, but they don't have to be. A fi­nite state ma­chine is a very sim­ple way to write a pars­er that is easy to un­der­stand and main­tain, and of­ten is enough.

Have fun pars­ing!


Contents © 2000-2025 Roberto Alsina