Writing a simple parser using a finite state machine
I wrote a literate programming tool called Crycco and at its core is a very simple parser which extracts comment blocks from source code and organizes them into a document.
If you want to see an example, Crycco's website is made out of its own code by processing it with itself.
The format is very simple: you write a comment block, then some code, then another comment block, and so on. The comment blocks are taken as markdown, and the code blocks are taken as source code.
The document is then split in sections: 1 chunk of doc, 1 chunk of code, and processed to make it look nice, with syntax highlighting and so on.
The parser was just a couple dozen lines of code, but when I wanted to add more features I ran into a wall, so I rewrote it using a finite state machine (FSM) approach.
Since, again, Crycco is literate code, you can see the parser by just looking at the source code of the parser itself, which is in the Document
class
Of course, that may not be enough, so let's go into details.
The state machine has a few states:
enum State
CommentBlock
EnclosingCommentBlock
CodeBlock
end
A comment block is something like this:
# This is a comment
# More comment
An enclosing comment block is a "multiline" comment, like this:
/* This is a comment
More comment
*/
Code blocks are lines that are not comments :-)
So, suppose you are in the CommentBlock
state, and you see a line that starts with #
you stay in the same state.
If you see a line that does not start with #
, you switch to the CodeBlock
state.
When you are in the CommentBlock
state, the line you are in is a comment. If you are in the CodeBlock
state, the line is code.
Here are the possible transitions in this machine:
state_machine State, initial: State::CodeBlock do
event :comment, from: [State::CodeBlock], to: State::CommentBlock
event :enclosing_comment_start, from: [State::CodeBlock], to: State::EnclosingCommentBlock
event :enclosing_comment_end, from: [State::EnclosingCommentBlock], to: State::CodeBlock
event :code, from: [State::CommentBlock], to: State::CodeBlock
end
Then, to parse the document, we go line by line, and call the appropriate event depending on the line we are reading. That event may change the state or not.
For example:
if is_comment.match(line) && !NOT_COMMENT.match(line)
self.comment {
# These blocks only execute when transitions are successful.
#
# So, this block is executed when we are transitioning
# to a comment block, which means we are starting
# a new section
@sections << Section.new(@language)
}
# Because the docs section is supposed to be markdown, we need
# to remove the comment marker from the line.
processed_line = processed_line.sub(@language.match, "") unless @literate
And that's it! We send the proper events to the machine, the machine changes state, we handle the line according to what state we are in, and we end up with a nicely parsed document.
Parsers are somewhat scary, but they don't have to be. A finite state machine is a very simple way to write a parser that is easy to understand and maintain, and often is enough.
Have fun parsing!