Full parser tutorial

From Gallium

Jump to: navigation, search

You can use Camlp4 as a complete parser (not a syntax extension). This means you will not start from some variant of OCaml but rather parse a completely new language. The backend will still be OCaml code and the OCaml compiler though. This can be used to quickly build a DSL compiler on top of ocamlc.

As an example, we will make a compiler for a very simple language consisting of strings, variables and print statements:

 def name = "world"
 print "hello", name

This program will print the message "hello world".




First, we define an AST for this language:

 type 'loc expr = Var of string * 'loc | String of string * 'loc ;;
 type 'loc stmt = Def of string * 'loc expr * 'loc | Print of 'loc expr list * 'loc ;;

An expression is either a variable reference or a string; a statement is either a variable definition or a print of some expressions. We want to associate a source location with every AST node; however we want to abstract over the concrete type of location that is used. We therefore declare loc as a type parameter for the type constructors expr and stmt.


We then define how such an AST is translated into an OCaml AST. The quotations we will use for this implicitly expect an Ast module with OCaml AST types. For now, we will work with the version inside the PreCast module:

 open Camlp4.PreCast ;;

The translation is very straightforward. We translate an expression into an OCaml expression, and we translate a statement into an OCaml structure item; we also define an auxiliary function to produce a string expression that concatenates a list of string expressions, interspersed with spaces:

 let expr_of_expr = function
 | Var (v,_loc) -> <:expr< $lid:v$ >>
 | String (s,_loc) -> <:expr< $str:s$ >> ;;
 let concat_exprs _loc = function
 | [] -> failwith "concat_exprs"
 | e::es -> List.fold_left (fun e e' -> <:expr< $e$ ^ " " ^ $e'$ >>) e es ;;
 let str_item_of_stmt = function
 | Def (v,e,_loc) -> <:str_item< let $lid:v$ = $expr_of_expr e$ ;; >>
 | Print (es,_loc) ->
   let es = List.map expr_of_expr es in
   <:str_item< print_endline $concat_exprs _loc es$ ;; >> ;;

Notice that the locations stored in the AST nodes are captured using the name _loc, which is the location used implicitly for the quotations. As such, the resulting OCaml AST nodes can be associated to the source location that caused it.


Next, we also define a grammar that can be used to parse our minimal language and produce an AST. We define two entry points (expr and stmt) and we extend these enty points with a rule for each possible case.

 let expr = Gram.Entry.mk "expr" ;;
 let stmt = Gram.Entry.mk "stmt" ;;
 expr: [
   [ v = LIDENT -> Var (v,_loc)
   | s = STRING -> String (s,_loc) ] ];
 stmt: [
   [ "def"; v = LIDENT; "="; e = expr -> Def (v,e,_loc)
   | "print"; es = LIST1 expr SEP ","  -> Print (es,_loc) ] ];
 END ;;

We then have to integrate these elements. The grammar should be used to parse the input; the resulting AST should be translated to an OCaml AST and this OCaml AST should be outputted.

There are three ways of doing this integration:

Flushing the initial grammar but keeping the rule structure and names

If we keep the rule structure, we can use the camlp4 program as a driver. However, we have to override the OCaml grammar at an appropriate point to install an entirely different language. To find this point, we will do some inspection of the grammar from the toplevel of the main entry points: for an implementation (.ml file) this is Syntax.implem, for an interface (.mli file) this is Syntax.interf, and for code entered at the toplevel, this is Syntax.top_phrase. Interfaces are beyond the scope of this tutorial, so we will focus on the other two:

 $ ocaml camlp4o.cma
 # open Camlp4.PreCast ;;
 # Gram.Entry.print Format.std_formatter Syntax.implem ;;
 implem: [ LEFTA
   [ "#"; a_LIDENT; opt_expr; semi
   | EOI
   | str_item; semi; SELF ] ]
 - : unit = ()
 # Gram.Entry.print Format.std_formatter Syntax.top_phrase ;;
 top_phrase: [ LEFTA
   [ EOI
   | phrase ] ]
 - : unit = ()
 # Gram.Entry.print Format.std_formatter Syntax.phrase ;;
 phrase: [ LEFTA
   [ "#"; a_LIDENT; opt_expr; semi
   | str_item; semi ] ]
 - : unit = ()
 # Gram.Entry.print Format.std_formatter Syntax.str_item ;;
 str_item: [ "top" LEFTA
   [ "let"; "module"; a_UIDENT; module_binding0; "in"; expr
   | "let"; opt_rec; binding; "in"; expr
   | "let"; opt_rec; binding

What this basically means, is that implementation is a sequence of structure items and directives (terminated by optional semicolons), and that a toplevel phrase is one structure item or directive (or the end-of-input). While directives might still be relevant in the context of our language, structure items are OCaml-specific. We therefore choose to override the grammar at the level of str_item. This influences both toplevel and implementation code.

Without functors

This is the most straightforward. We complete our code from above as follows:

 Gram.Entry.clear Syntax.str_item ;;
 Syntax.str_item : [ [ s = stmt -> str_item_of_stmt s ] ];
 END ;;

We first flush the existing definitions of str_item and then install a new rule: we parse a statement and translate it to a structure item.

With functors (using Register.OCamlSyntaxExtension)

To do the same with functors instead of the PreCast module, we have to start our definition differently. Instead of opening that module in the above code, we will write the following things.

First we import the camlp4 module types (such as Id, Loc, Ast, Camlp4Syntax etc.), and the camlp4 token constructors (such as LIDENT, STRING, etc).

 open Camlp4.Sig ;;

Next we define an Id module for our syntax (which camlp4 still views as a syntax extension):

 module Id = struct
   let name = "MinimalLanguage" ;;
   let version = "0.1" ;;
 end ;;

We can then open our syntax extension module. It is a functor that defines a Camlp4Syntax in terms of a given Camlp4Syntax. We start by opening the given syntax:

 module Minimal (Syntax : Camlp4Syntax) = struct
   open Syntax;;

Importing Syntax makes a number of modules available, most importantly Ast, Loc and Gram. That means we can include the definition of the translation and the parser unmodified.

After those definitions, we can override the str_item entry point as before. Since we already opened the Syntax module, we do not need to include that module name in the references to str_item:

 Gram.Entry.clear str_item ;;
 str_item : [ [ s = stmt -> str_item_of_stmt s ] ];
 END ;;

Finally, we close our syntax extension module with:

   include Syntax ;;
 end ;;

Since we include the original Camlp4Syntax module at the end, our syntax extension functor defines the same syntax as the one that it took as an argument. Nonetheless, the grammar modification statements will be applied at functor application time. (We could have just included the Syntax in the beginning of the definition of Minimal, but then our definition of expr would have overridden the existing expr value and this would have clashed.)

Finally, we register this syntax extension and its Id module. This occurs as a side-effect of a functor application, so we can do it as a local module definition that is never used.

 let module M = Camlp4.Register.OCamlSyntaxExtension(Id)(Minimal) in () ;;


We can compile the syntax extension (defined either with or without functors) using camlp4of:

 ocamlc -c -I +camlp4 -pp camlp4of minimal.ml

Afterwards, it can be used with the camlp4o binary. For example, if test.ml contains the example program given at the beginning of the article, we can see the generated code:

 $ camlp4o minimal.cmo test.ml
 let name = "world"
 let _ = print_endline ("Hello" ^ (" " ^ name))

Or compile and run it:

 $ ocamlc -o test -pp "camlp4o minimal.cmo" test.ml && ./test
 Hello world

Together with camlp4o.cma, it can be used to create a toplevel for our language:

 $ ocaml camlp4o.cma minimal.cmo
 # def name = "world" ;;
 val name : string = "world"
 # print "Hello", name ;;
 Hello world
 - : unit = ()
 # 1+1 ;;
 Parse error: illegal begin of top_phrase

Notice that semantic errors detected by OCaml will be reported with appropriate source locations, for example, a program with the single line:

 print foo

will produce the error:

 $ ocamlc -c -pp "camlp4o minimal.cmo" foo.ml
 File "foo.ml", line 1, characters 6-9:
 Unbound value foo

Make a new grammar using the same lexer and token type

 open Camlp4.PreCast;;
 module Gram = MakeGram(Lexer);;

Then you have to write a main function for your parser, camlp4 cannot do it for you.

Make a new lexer and token type and then make a new grammar


Personal tools
Espace privé