Syntax extension tutorial

From Gallium

Jump to: navigation, search

Where does the universe come from? Why are we here? Those questions cannot be easily answered. However, "How can I make a syntax extension using Camlp4?" is answerable!

Contents

Aim

What we want to do here is to create a calculator, which is something like the all-time favorite of parsers, grammars, etc. We will do that by making a module that can understand (parse) and compute expressions, and which we will plug in Camlp4. Wait. How? Let's see!

Setting things up

A Camlp4 syntax extension comes as a bytecode module (.cmo) or a library (.cma) that Camlp4 dynamically loads. This module (or set of modules in the case of library) extends the syntax understood by Camlp4. Hence our calculator will be a syntax extension too!

A couple of things to consider

  • how to name our syntax extension?
  • what language(s) should we use? Basically, we have three choices: (see Using Camlp4 for more details)
    • we write everything using the original syntax
    • we write everything using the revised syntax
    • we write parts using the original syntax and parts using the revised one. As surprising as it may be, this is how most of Camlp4 is written.
  • what syntax should we extend?

What we gonna do

  • using Camlp4 3.09 conventions, we will call our module pa_calculator.
  • for know, let's stick to the usual (original) syntax: we will use the camlp4of compiler. You will have plenty of time to learn the revised syntax later, if you wish to.
  • since original and revised parsers already know how to parse expressions, we will start from an empty grammar.

What you gonna do

  • go and create a file named "pa_calculator.ml"
  • compile it:
    ocamlc -I +camlp4 camlp4lib.cma -pp camlp4of.opt pa_calculator.ml
    Note the "-I +camlp4", which will be necessary as soon as we start using Camlp4 modules. Also note the "-pp camlp4of.opt", which tells ocamlc to pre-process our code with camlp4of.opt (the native version of camlp4of)
  • execute it:
    camlp4 -parser pa_calculator.cmo
    This loads the pa_calculator module inside Camlp4, and (currently) does nothing more. Anyway, we now have the basis for working with our calculator! So, as they say on this crappy TV show, "Next !"

Creating the syntax extension skeleton

Our syntax extension is meant to be simple, so we use the quick and non-extensible way.

Preamble

open Camlp4.PreCast

This should be the first line of pa_calculator. This creates everything we need in Camlp4:

  • locations (module Loc)
  • lexer (module Lexer)
  • grammar (module Gram)
  • and other things we do not really care for now :-)

Creating our main rule (entry point)

let expression = Gram.Entry.mk "expression"

This creates an entry named "expression". Note that the actual entry and its name are independent, but it seems good practice to use the same string (would it really make sense anyway if you created an "open_declaration" entry whose name is "class_method"? Don't think so).

Extending the grammar

The default grammar loaded by Camlp4.PreCast is contained in the module Gram.

EXTEND Gram

This statement tells Camlp4 we start to extend the "Gram" grammar.

  GLOBAL: expression;

GLOBAL declares "expression" as being a non-local grammar entry, while other entries will be local only. Note that if you don't write a GLOBAL statement, all entries are considered global.

Expressing our main entry

  expression: [ [ INT ] ];

Here we defined the "expression" entry with only one rule, which just accepts an INT token. We will add more rules in the next section.

Ending the grammar extension

END

We are done with extending the "Gram" grammar.

Calling the parser

let _ = Gram.parse_string expression (Loc.mk "<string>") (read_line ())

This is the entry point of our module. It just parses the string read from stdin as an expression. The Loc.mk function creates a location, using the string provided as its name (so here it is "<string>"). This location will be used to report errors; for example if you replace type " a", this is what you obtain:

$ camlp4 -parser pa_calculator.cmo
   a
File "<string>", line 1, characters 3-4:
Parse error: illegal begin of expression

An error occurs because our syntax extension only accepts integers as valid expressions. So... let's extend our expression entry!

Extending the expression rule

Now that we have a working parser, we want to be able to parse something more than just integers. We'll start by parsing additions and subtractions.

Sum and difference

Adding rules

  expression:
    [ [ SELF; "+"; SELF
    | SELF; "-"; SELF
    | INT] ];

There are two new things here: first, inside the brackets, we have replaced the single rule [ [ INT ] ] by several rules separated by "|". Second, the keyword SELF that is a synonym for "the entry being defined"; this can be particularly useful when changing the entry name.

Getting a result

So far we can parse things such as "1 + 2 + 3", but that is of little use since we do not compute any result. Assuming we want only integers, we can return an integer result this way:

  expression:
    [ [ x = SELF; "+"; y = SELF -> x + y
    | x = SELF; "-"; y = SELF -> x - y
    | x = INT -> int_of_string x ] ];

Here we use the result of expression (x = SELF), and associate actions with rules (with "->"). This results in expression now returning an integer.

Displaying the result

This is now obvious: just add a call to "print_int"!

  print_int
    (Gram.parse_string expression
        (Loc.mk "<string>") (read_line ()))

Product and quotient

Same steps with "*" and "/":

  expression:
    [ [ x = SELF; "+"; y = SELF -> x + y
    | x = SELF; "-"; y = SELF -> x - y
    | x = SELF; "*"; y = SELF -> x * y
    | x = SELF; "/"; y = SELF -> x / y
    | x = INT -> int_of_string x ] ];

If you're familiar with parsers, you have already seen what is wrong here :) Otherwise, compile, execute, and try something like "1 + 2 * 3".

While the result should be 7, it is 9. As Camlp4 has no way of telling which rule to give priority to, it just takes the first one. So "1 + 2 * 3" is taken as "(1 + 2) * 3". Let's see how we can overcome that.

Managing priorities

Since times (*) and slash (/) have priority over plus (+) and minus (-), we need to explain that to Camlp4. This is done using levels.

Adding levels

  expression:
    [ [ x = SELF; "+"; y = SELF -> x + y
      | x = SELF; "-"; y = SELF -> x - y ]
    | [ x = SELF; "*"; y = SELF -> x * y
      | x = SELF; "/"; y = SELF -> x / y ]
    | [ x = INT -> int_of_string x ] ];

We have split the entry in 3 levels: the first one for additions/subtractions, the second one for multiplications/divisions, and the third one for integers. When encountering "1 + 2 * 3", the parser will first try to see it as a sum of two expressions, "1" and "2 * 3", and will output the correct result.

If you invert the first and second levels, you get the same behaviour as in the previous section.

Names

We can name levels:

  expression:
    [ "add"
      [ x = SELF; "+"; y = SELF -> x + y
      | x = SELF; "-"; y = SELF -> x - y ]
    | "mul"
      [ x = SELF; "*"; y = SELF -> x * y
      | x = SELF; "/"; y = SELF -> x / y ]
    | "simple"
      [ x = INT -> int_of_string x ] ];

Associativity

Additionally, we can specify the associativity of levels, the default being LEFTA, which is the most natural. Associativity defines how expressions such as "1 + 2 + 3" should be understood:

  • using left associativity, this is equivalent to "(1 + 2) + 3"
  • using right associativity, this is equivalent to "1 + (2 + 3)".

For such things as addition, or multiplication, this does not matter. For substraction and division they are left associatives so the default is the good one. Let's define a new operator which is right associative: the pow operator, which we define as "**":

  expression:
    [ "add" LEFTA
      [ x = SELF; "+"; y = SELF -> x + y
      | x = SELF; "-"; y = SELF -> x - y ]
    | "mul" LEFTA
      [ x = SELF; "*"; y = SELF -> x * y
      | x = SELF; "/"; y = SELF -> x / y ]
    | "pow" RIGHTA
      [ x = SELF; "**"; y = SELF -> int_of_float ((float x) ** (float y)) ]
    | "simple" NONA
      [ x = INT -> int_of_string x ]
    ];

The RIGHTA is important since if you try to compute something like 223 (28 = 256), you write "2 ** 2 ** 3", and you obtain 256 with RIGHTA and 64 with LEFTA.

And then?

By now you should be familiar with the basic concepts of syntax extension. If you want to know more about it, you may consult the Syntax extension page.

If you want to generate OCaml code, you'd probably like to check OCaml code generation tutorial.

Personal tools
Espace privé