Skip to main content

Ralsina.Me — Roberto Alsina's website

Creating Languages For Dummies

Intro

I don't have the usu­al pro­gram­mer's ed­u­ca­tion. I stud­ied math­s, and then dropped out of that, and am most­ly self­-­taugh­t. So, there are some parts of pro­gram­ming I al­ways saw weari­ly, think­ing to my­self that I re­al­ly should go to school to learn them. One re­mark­able such area is pars­ing and im­ple­ment­ing lan­guages.

Well... sure, school is al­ways a good idea, but this is not that hard. In this ar­ti­cle I will ex­plain how to go from noth­ing to a func­tion­ing, ex­ten­si­ble lan­guage, us­ing Python and Py­Pars­ing. If you are as scared of gram­mars, parsers and all that jazz as I used to be, come along, it's pret­ty sim­ple,

[TOC]

Grammar (part I)

What is a gram­mar? It's what de­fines what is and what is not valid in our lan­guage, and it's the hard­est part to get right, most­ly be­cause it's all about mak­ing de­ci­sion­s.

First de­ci­sion: lan­guage name. I will call this lan­guage Least Op­ti­mized Lan­guage Ev­er, or LOLE

I will de­scribe the lan­guage in­for­mal­ly, then show you how to de­scribe that gram­mar for­mal­ly us­ing Py­Pars­ing.

Py­Pars­ing is "con­struc­tive", you start cre­at­ing small block­s, then com­bine those blocks in­to larg­er things.

Identifiers

Some things in LOLE will have names. For ex­am­ple, vari­ables, or func­tion­s. So, we need to de­fine what a name is. Most lan­guages al­low you to use iden­ti­fiers that start with a let­ter, and then use let­ters or num­ber­s. Let's go even sim­pler: iden­ti­fiers are word­s. Just let­ter­s.

identifier = Word(alphas)

Word is a PyParsing object that means "these characters". alphas is a constant that means "letters". So: "some letters".

You can ac­tu­al­ly try this in an in­ter­ac­tive in­ter­preter:

>>> from pyparsing import *
>>> identifier = Word(alphas)
>>> identifier.parseString("foo")
(['foo'], {})

identifier.parseString takes a string, and tries to parse it as an identifier. So, "foo" is an identifier, and we get the match. Nice!

What hap­pens if we try some­thing that does­n't match?

>>> identifier.parseString("42")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "pyparsing.py", line 1632, in parseString
    raise exc
ParseException: Expected W:(ABCD...) (at char 0), (line:1, col:1)

Sur­prise! You get an er­ror. It's a pret­ty aw­ful er­ror, but it ba­si­cal­ly says "I was ex­pect­ing a thing made of let­ters (ABCD...) and I did not get it".

Be­cause er­ror mes­sages are im­por­tant let's make that bet­ter al­ready. There's al­ways time to de­vel­op bad habits lat­er.

>>> identifier = Word(alphas).setName('identifier')
>>> identifier.parseString("42")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "pyparsing.py", line 1632, in parseString
    raise exc
ParseException: Expected identifier (at char 0), (line:1, col:1)

Now it says it was ex­pect­ing an iden­ti­fier, and it did not get one. So it's more clear. Let's try one last thing:

>>> identifier.parseString("foo bar")
(['foo'], {})

You may say "hey, there are two identifiers there, and it only found one!" and you are right. That's because we are parsing identifier not identifierS. We told it to understand "foo bar" as an identifier, and it did just that. By default PyParsing will ignore anything left in the input after it found what it's looking for. If you want to make sure the whole input parses correctly, then you have to say so:

>>> identifier.parseString("foo bar", parseAll=True)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "pyparsing.py", line 1632, in parseString
    raise exc
ParseException: Expected end of text (at char 4), (line:1, col:5)

And yes, that is a valid error. We asked to know if "foo bar" is an identifier. It's not, because to be one it would have to end after "foo" because identifiers can't have spaces, and it does not.

Let's move to some­thing more chal­leng­ing....

Literals: Strings and Numbers

LOLE has lit­er­als for two type­s: strings and num­ber­s. They are the usu­al thing, ex­cept that strings are python style, ei­ther sin­gle or dou­ble quot­ed, and sup­port es­cap­ing quotes in­side them.

I could now ex­plain how to de­fine num­bers in terms of lit­er­als and com­bi­na­tions of sym­bols and num­bers and so on. And strings as a list of char­ac­ters sur­round­ed by quotes. But I am go­ing to tell you some­thing much more im­por­tan­t: learn your li­brary.

literal = quotedString ^ pyparsing_common.number

quotedString is provided by PyParsing to define "strings as usual". And number defines "numbers as usual" so we don't have to do it. Because most languages need this, so having to do it each time is dumb.

We are al­so us­ing ^ there. That means "one or the oth­er". So 'lit­er­al' will suc­cess­ful­ly parse things that are strings and al­so things that are num­ber­s.

>>> literal.parseString("-43.21E12")
([-43210000000000.0], {})
>>> literal.parseString(r"'this is a string with a \' in it!'")
(["'this is a string with a \\' in it!'"], {})

And of course, if you parse some­thing else, it will fail:

>>> literal.parseString(r"foo")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "pyparsing.py", line 1632, in parseString
    raise exc
ParseException: Expected {quotedString using single or double quotes ^ {real number with scientific notation | real number | signed integer}} (at char 0), (line:1, col:1)

Like the er­ror says, it was ex­pect­ing ei­ther a string quot­ed with sin­gle or dou­ble quotes, or one of a few kinds of num­ber­s.

So, now that we have iden­ti­fiers and val­ues, we can make this more like a re­al lan­guage and im­ple­ment ...

Assignments

This is pret­ty sim­ple:

foo = 42

So, an as­sign­ment is (for now):

  • An iden­ti­fi­er
  • An equal sign
  • A lit­er­al val­ue

And this is how it looks in Py­Pars­ing:

>>> assignment = identifier + Literal('=') + literal
>>> assignment.parseString('foo = 42')
(['foo', '=', 42], {})

Program

So far, we on­ly sup­port pro­grams that do one thing and then end. That is not very use­ful. So let's de­fine what a pro­gram is: "A list of state­ments sep­a­rat­ed with semi­colon­s". So far, the on­ly kind of state­ment we have is as­sign­men­t, so it will look like this:

program = delimitedList(assignment, delim=';')

As usu­al, Py­Pars­ing does the heavy lift­ing for us: see what de­lim­it­edList does.

And now we can parse slight­ly more in­ter­est­ing things:

>>> program.parseString('foo = 42; bar="bat"')
(['foo', '=', 42, 'bar', '=', '"bat"'], {})

Let's stop defin­ing gram­mar now for a lit­tle bit (we will come back to it) and think about how to use this thing we are cre­at­ing.

Execution (Part I)

So we now have the world's most use­less lan­guage, that on­ly sup­ports syn­tax for as­sign­ing val­ues to vari­ables. How do we make it do that?

Well, we need to take the output of parseString and process it, doing what our "program" says we should do. That's an interpreter.

So, how do we "run" ['foo', '=', 42, 'bar', '=', '"bat"']? For starters, I don't really like it. There are two statements there but we have six elements and no clear boundary between one statement and another. To fix that, we can change assignment:

>>> assignment = Group(identifier + Literal('=') + literal)
>>> program = delimitedList(assignment, delim=';')
>>> print program.parseString('foo = 42; bar="bat"')
[['foo', '=', 42], ['bar', '=', '"bat"']]

Un­sur­pris­ing­ly, what Group does is group things, so we now have each state­ment as its own sep­a­rate thing.

So, we could "run" this pro­gram like this:

>>> def run_program(code):
...    parsed = program.parseString(code)
...    vars = {}
...    for name, _, value in parsed:
...        vars[name] = value
...    return vars
>>> run_program('foo = 42; bar="bat"')
{'foo': 42, 'bar': '"bat"'}

And hey, that means this is an in­ter­preter al­ready. Not a very good one, tho!

Grammar (Part II)

Now, how about this LOLE program? foo = 42; bar = foo

As we have defined the language so far, that's illegal, because on the right side of the equal sign LOLE requires a literal, which is either a string or a number. What we really want on the right side is more general: an expression. An expression is something that eventually resolves to a value. So, variable names are expressions, function calls (when we have them) will be expressions, and things like 2 + 2 would be expressions if we implemented them.

So, let's in­tro­duce ex­pres­sion­s, which are ei­ther lit­er­al­s, or iden­ti­fiers (so far):

>>> expression = literal ^ identifier
>>> assignment = Group(identifier + Literal('=') + expression)
>>> program = delimitedList(assignment, delim=';')
>>> print program.parseString("foo = 42; bar = foo")
[['foo', '=', 42], ['bar', '=', 'foo']]

Execution (Part II)

So, a LOLE ex­pres­sion in an as­sign­ment can cur­rent­ly be three kinds of things:

  • A number: 42
  • An identifier: 'foo'
  • A string: '"foo"'

What happens if we run this through run_program?

>>> run_program("foo = 42; bar = foo")
{'foo': 42, 'bar': 'foo'}

That is not really right, is it? We would want bar to be set to 42. The problem is that we are not calculating the value of expressions, but just passing them along. We need eval_expression:

>>> vars = {}
>>> def eval_expression(expr):
...    if isinstance(expr, str):
...        if expr[0] in '"\'':  # string literal
...            return expr[1:-1] # remove quotes
...        else: # identifier
...            return vars.get(expr)
...    # Number, return as-is
...    return expr

This solves that prob­lem of "re­solv­ing" vari­ables, and al­so that of ex­tra­ne­ous quotes in our string lit­er­al­s. And then we need to use it in run_pro­gram, too:

>>> def run_program(code):
...    parsed = program.parseString(code)
...    for name, _, value in parsed:
...        vars[name] = eval_expression(value)
...    return vars
   
>>> run_program("foo = 'quux'; bar = foo")
{'foo': 'quux', 'bar': 'quux'}

Grammar (Part III)

Let's make LOLE slight­ly more in­ter­est­ing by adding func­tion­s. Since cre­at­ing gram­mars is most­ly a mat­ter of mak­ing de­ci­sion­s, let's make a cou­ple.

  • A function call is identifier() or identifier(arg) or identifier(foo=expr, bar=expr). If a function takes more than one argument, it has to use keyword arguments.
  • Func­tions are not de­fined in LOLE, they are pro­vid­ed by Python.

While key­word ar­gu­ments look a lot like as­sign­ments, they are not the same thing, so we de­fine them sep­a­rate­ly.

>>> kwarg = Group(identifier + Literal('=') + expression)
>>> function_call = Group(identifier + Literal('(') + Optional(expression ^ delimitedList(kwarg)) + Literal(')'))
>>> print function_call.parseString('foo(bar=1, baz=2)')
[['foo', '(', ['bar', '=', 1], ['baz', '=', 2], ')']]
>>> print function_call.parseString('foo(bar=1)')
[['foo', '(', ['bar', '=', 1], ')']]
>>> print function_call.parseString('foo(42)')
[['foo', '(', 42, ')']]
>>> print function_call.parseString('foo()')
[['foo', '(', ')']]

This shows we have a num­ber of prob­lems in how we process our code.

How can we distinguish ['foo', '(', ')'] (a function call) from ['foo', '=', 4] (an assignment)? Clearly the output of parsing a function call is not distinctive enough. We can improve this by using a PyParsing feature called parseActions.

What parseActions do is process the result of parsing and allow you to use the knowledge of what you are parsing to do The Right Thing (TM).


class FunctionCall():
    pass

def function_call_action(tokens):
    f_call = FunctionCall()
    f_call.name = tokens[0][0]
    f_call.args = tokens[0][1:]
    return f_call
    
function_call = Group(identifier + Literal('(') + Optional(expression ^ delimitedList(kwarg)) + Literal(')'))
function_call = function_call.setParseAction(function_call_action)

>>> print function_call.parseString('foo()')[0].name
foo
>>> print function_call.parseString('foo()')[0].args
['(', ')']

Do we re­al­ly want args to con­tain the paren­the­sis? I say no. While they are use­ful for defin­ing the syn­tax, we re­al­ly don't need them at al­l, so we can use sup­press to make our parser's out­put clean­er:

function_call = Group(identifier + Literal('(').suppress() + Optional(expression ^ delimitedList(kwarg)) + Literal(')').suppress())
function_call = function_call.setParseAction(function_call_action)

>>> print function_call.parseString('foo()')[0].name
foo
>>> print function_call.parseString('foo()')[0].args
[]
>>> print function_call.parseString('foo(42)')[0].args
[42]
>>> print function_call.parseString('foo(a=5, b=7)')[0].args
[(['a', '=', 5], {}), (['b', '=', 7], {})]

And since call­ing func­tions pro­duces a val­ue, it's a type of ex­pres­sion. And since we can just want to run a func­tion and not care about what it re­turn­s, it's al­so a type of state­men­t. Let's put all our gram­mar in one place:

identifier = Word(alphas)
literal = quotedString ^ pyparsing_common.number
function_call = Forward()
expression = literal ^ function_call
kwarg = Group(identifier + Literal('=') + expression)
assignment = Group(identifier + Literal('=') + expression)
function_call << Group(identifier + Literal('(').suppress() + Optional(expression ^  delimitedList(kwarg)) + Literal(')').suppress()).setParseAction(function_call_action)

statement = assignment ^ function_call
program = delimitedList(statement, delim=';')

The only new things are related to function_call. Since its definition needs to know about expression, and expression is defined using function_call we have a bit of a chicken-egg problem. The solution is to use Forward, so we say "hey, there is a thing called function_call, I will explain later".

Then, when actually defining function_call it uses << so the definition is applied to the Forward placeholder.

Execution (Part III)

Since we added function calls ... how do we execute those functions? By improving our run_program of course. So far, we treated every statement as an assignment, but now it can be either an assignment or a function_call.

Also, our eval_expression knew how to evaluate literals and variables, but not function calls, so that will need some tweaking also.

Let's do eval_expression first:

# This dictionary exposes functions to the LOLE interpreter
funcs = {
    'sin': sin,
}

def eval_expression(expr):
    if isinstance(expr, str):
       if expr[0] in '"\'':  # string literal
           return expr[1:-1] # remove quotes
       else: # identifier
           return vars.get(expr)
    elif isinstance(expr, FunctionCall):
        a = []
        kw = {}
        # Process function call arguments
        for arg in expr.args:
            if isinstance(arg, list):
                # argument is a kwarg
                kw[arg[0]] = eval_expression(arg[2])
            else:
                # argument is a literal value
                a.append(eval_expression(arg))
        return funcs[expr.name](*a, **kw)
    # Number, return as-is
    return expr

Sure, it got a little more complicated, but basically, all it does is add a special case for when the expression is a FunctionCall and handle that case by calling the named function with the given arguments. Since arguments are expressions themselves, we need to recursively call eval_expression to get their value.

And now we need to do some­thing sim­i­lar with run_pro­gram:

def run_program(code):
    parsed = program.parseString(code)
    for statement in parsed:
        if isinstance(statement, FunctionCall):
            eval_expression(statement)
        else:  # assignment
            name, _, value = statement
            vars[name] = eval_expression(value)
    return vars

>>> print run_program('x = sin(0); sin(1)')
{'x': 0.0}

Again, while it­er­at­ing over the state­ments, we check if the state­ment is a func­tion cal­l, in which case we eval­u­ate it, or if it's an as­sign­men­t, in which case we eval­u­ate the right side and as­sign that val­ue to the vari­able named on the left side.

This shows how extending the language works. If we add a new type of expression in the grammar, we add a new branch to eval_expression. If we add a new type of statement in the grammar, we add a branch to run_program. That is all there is to it. To prove it, let's add a more complex structure.

Grammar (Part IV)

What kind of language doesn't support an if statement? Well this one, until now. So let's add it. It could look like this:

if expression then { statements } [else { statements }]

The brackets around else mean that part is optional, since not all ifs need an else clause. The conditionally executed parts are a list of statements surrounded by braces, C-style.

Let's ex­press this in Py­Pars­ing:

code_block = Forward().setName('code block')
if_stmt = Group(Keyword('if') + expression + Keyword('then') + code_block + Optional(Keyword('else') + code_block))

statement = if_stmt ^ assignment ^ function_call
code_block << Group(Literal('{').suppress() + delimitedList(statement, delim=';') + Literal('}').suppress()).setName('code block')

program = delimitedList(statement, delim=';')

>>> print program.parseString('if 0 then {x=2} else {x=3}')
['if', 0, 'then', [['x', '=', 2]], 'else', [['x', '=', 3]]]

Again, we have to use Forward, now for code_block since if_stmt contains code_block which uses statement but if_stmt is also a statement. Also, some calls to setName so errors are nice and tidy, and use the Keyword element we had not seen before.

Just like with function calls, we end up with just a bunch of things in a list, which means when we want to execute it it will be annoying. We can improve it in exactly the same way via setParseAction:

class IfStmt():
    pass

def if_stmt_action(tokens):
    if_stmt = IfStmt()
    if_stmt.condition = tokens[0][1]
    if_stmt.then_block = tokens[0][3]
    if len(tokens[0]) > 4:  # has an else clause
        if_stmt.else_block = tokens[0][5]
    else:
        if_stmt.else_block = None
    return if_stmt

if_stmt = Group(Keyword('if') + expression + Keyword('then') + code_block + Optional(Keyword('else') + code_block)).setParseAction(if_stmt_action)

>>> if_stmt = print program.parseString('if 0 then {x=2} else {x=3}')[0]
>>> print if_stmt.condition
0
>>> print if_stmt.then_block
['x', '=', 2]
>>> print if_stmt.else_block
['x', '=', 3]

Execution (Part IV)

And, since we added a new kind of statement, we need to handle it in run_program:

def run_program(code):
    if isinstance(code, str):
        parsed = program.parseString(code)
    else:
        parsed = code
    for statement in parsed:
        if isinstance(statement, FunctionCall):
            eval_expression(statement)
        elif isinstance(statement, IfStmt):
            if eval_expression(statement.condition):
                run_program(statement.then_block)
            elif statement.else_block is not None:
                run_program(statement.else_block)
        else:  # assignment
            name, _, value = statement
            vars[name] = eval_expression(value)
    return vars

>>> print run_program('if 0 then {x=2} else {x=3}')
{'x': 3}
>>> print run_program('if 1 then {x=2} else {x=3}') 
{'x': 2}

The new bit­s:

  • There is a branch to han­dle if state­ments. Since if state­ments con­tain code block­s, which are ba­si­cal­ly the same as a pro­gram, we end up call­ing run_pro­gram re­cur­sive­ly.

  • To avoid com­pli­cat­ing the ex­am­ple too much, run_pro­gram now needs to han­dle both un­parsed pro­grams (when we want to run a scrip­t) and parsed code, when it's called re­cur­sive­ly. That is man­aged by the if at the be­gin­ning of the func­tion.

Conclusion and Full Source Code

And there you go, we now have an ex­ten­si­ble lan­guage that can call ar­bi­trary Python func­tion­s! And here is the full list­ing:

from math import sin

from pyparsing import *
from pyparsing import pyparsing_common

class FunctionCall():
    pass

def function_call_action(tokens):
    f_call = FunctionCall()
    f_call.name = tokens[0][0]
    f_call.args = tokens[0][1:]
    return f_call

class IfStmt():
    pass

def if_stmt_action(tokens):
    if_stmt = IfStmt()
    if_stmt.condition = tokens[0][1]
    if_stmt.then_block = tokens[0][3]
    if len(tokens[0]) > 4:  # has an else clause
        if_stmt.else_block = tokens[0][5]
    else:
        if_stmt.else_block = None
    return if_stmt

identifier = Word(alphas)
literal = quotedString ^ pyparsing_common.number
function_call = Forward().setName('function call')
expression = (literal ^ function_call).setName('expression')
kwarg = Group(identifier + Literal('=') + expression)
assignment = Group(identifier + Literal('=') + expression).setName('assignment')
function_call << Group(identifier + Literal('(').suppress() + Optional(expression ^  delimitedList(kwarg)) + Literal(')').suppress()).setParseAction(function_call_action)

code_block = Forward().setName('code block')
if_stmt = Group(Keyword('if') + expression + Keyword('then') + code_block + Optional(Keyword('else') + code_block)).setParseAction(if_stmt_action)

statement = if_stmt ^ assignment ^ function_call
code_block << Group(Literal('{').suppress() + delimitedList(statement, delim=';') + Literal('}').suppress()).setName('code block')

program = delimitedList(statement, delim=';')

funcs = {
    'sin': sin,
}

vars = {}

def eval_expression(expr):
    if isinstance(expr, str):
       if expr[0] in '"\'':  # string literal
           return expr[1:-1] # remove quotes
       else: # identifier
           return vars.get(expr)
    elif isinstance(expr, FunctionCall):
        a = []
        kw = {}
        # Process function call arguments
        for arg in expr.args:
            if isinstance(arg, list):
                # argument is a kwarg
                kw[arg[0]] = eval_expression(arg[2])
            else:
                # argument is a literal value
                a.append(eval_expression(arg))
        return funcs[expr.name](*a, **kw)
    # Number, return as-is
    return expr

def run_program(code):
    if isinstance(code, str):
        parsed = program.parseString(code)
    else:
        parsed = code
    for statement in parsed:
        if isinstance(statement, FunctionCall):
            eval_expression(statement)
        elif isinstance(statement, IfStmt):
            if eval_expression(statement.condition):
                run_program(statement.then_block)
            elif statement.else_block is not None:
                run_program(statement.else_block)
        else:  # assignment
            name, _, value = statement
            vars[name] = eval_expression(value)
    return vars

Exercises for the Reader

To make sure you ac­tu­al­ly un­der­stood this, the best way is to con­tin­ue this work your­self. LOLE will stay for­ev­er un­fin­ished so you can make it do in­ter­est­ing things while you learn by do­ing. Here are a few ques­tions for you to ex­plore in no par­tic­u­lar or­der:

  • How would you make a LOLE pro­gram ex­it?
  • How would that ex­it­ing pro­gram re­turn a re­sult?
  • The FunctionCall and IfStmt classes are very sketchy. How would you improve them?
  • Could the logic in run_program to handle different types of statements be moved out, into classes like FunctionCall?
  • How would you implement comparisons so if is more useful?
  • What hap­pens if you cre­ate a vari­able called 'if'?
  • What hap­pens if you have a vari­able and a func­tion with the same name? Is that good?
  • How would you im­ple­ment list­s? How about dic­tio­nar­ies?
  • How would you im­ple­ment it­er­a­tion?
  • How would you sup­port com­ments?
Paul McGuire / 2017-04-12 13:29:

Very nice article! Good use of pyparsing features, including the new `pyparsing_common` expressions. You should post this link on reddit.

Roberto Alsina / 2017-04-12 13:33:

It's usually better if someone else posts to reddit. Could you? :-)

Ron Barak / 2017-04-13 16:24:

Erratum:
So we now have the world's most usesess language -> So we now have the world's most useless language

Roberto Alsina / 2017-04-16 18:32:

Thx, fixed, rolling out soon.

A. Jesse Jiryu Davis / 2017-04-13 21:12:

Beautifully explained, great work. I read to the end and enjoyed it even though I have no plans to write a parser. I bet that if I did this introduction would be invaluable.

Roberto Alsina / 2017-04-16 18:32:

cool :-)

what essay services provide / 2017-09-03 13:59:

A great guide for those people who are still starting in creating a language that would best fit on their standards. At least, it would be a great factor for them to even make their performance even better and would help them broaden their mind when it comes to programming.


Contents © 2000-2020 Roberto Alsina