Full parser tutorial
You can use Camlp4 as a complete parser (not a syntax extension). This means you will not start from some variant of OCaml but rather parse a completely new language. The backend will still be OCaml code and the OCaml compiler though. This can be used to quickly build a DSL compiler on top of ocamlc.
As an example, we will make a compiler for a very simple language consisting of strings, variables and print statements:
def name = "world" print "hello", name
This program will print the message "hello world".
First, we define an AST for this language:
type 'loc expr = Var of string * 'loc | String of string * 'loc ;; type 'loc stmt = Def of string * 'loc expr * 'loc | Print of 'loc expr list * 'loc ;;
An expression is either a variable reference or a string; a statement is either a variable definition or a print of some expressions. We want to associate a source location with every AST node; however we want to abstract over the concrete type of location that is used. We therefore declare
loc as a type parameter for the type constructors
We then define how such an AST is translated into an OCaml AST. The quotations we will use for this implicitly expect an
Ast module with OCaml AST types. For now, we will work with the version inside the
open Camlp4.PreCast ;;
The translation is very straightforward. We translate an expression into an OCaml expression, and we translate a statement into an OCaml structure item; we also define an auxiliary function to produce a string expression that concatenates a list of string expressions, interspersed with spaces:
let expr_of_expr = function | Var (v,_loc) -> <:expr< $lid:v$ >> | String (s,_loc) -> <:expr< $str:s$ >> ;;
let concat_exprs _loc = function |  -> failwith "concat_exprs" | e::es -> List.fold_left (fun e e' -> <:expr< $e$ ^ " " ^ $e'$ >>) e es ;;
let str_item_of_stmt = function | Def (v,e,_loc) -> <:str_item< let $lid:v$ = $expr_of_expr e$ ;; >> | Print (es,_loc) -> let es = List.map expr_of_expr es in <:str_item< print_endline $concat_exprs _loc es$ ;; >> ;;
Notice that the locations stored in the AST nodes are captured using the name
_loc, which is the location used implicitly for the quotations. As such, the resulting OCaml AST nodes can be associated to the source location that caused it.
Next, we also define a grammar that can be used to parse our minimal language and produce an AST. We define two entry points (
stmt) and we extend these enty points with a rule for each possible case.
let expr = Gram.Entry.mk "expr" ;; let stmt = Gram.Entry.mk "stmt" ;;
EXTEND Gram expr: [ [ v = LIDENT -> Var (v,_loc) | s = STRING -> String (s,_loc) ] ]; stmt: [ [ "def"; v = LIDENT; "="; e = expr -> Def (v,e,_loc) | "print"; es = LIST1 expr SEP "," -> Print (es,_loc) ] ]; END ;;
We then have to integrate these elements. The grammar should be used to parse the input; the resulting AST should be translated to an OCaml AST and this OCaml AST should be outputted.
There are three ways of doing this integration:
Flushing the initial grammar but keeping the rule structure and names
If we keep the rule structure, we can use the camlp4 program as a driver. However, we have to override the OCaml grammar at an appropriate point to install an entirely different language. To find this point, we will do some inspection of the grammar from the toplevel of the main entry points: for an implementation (.ml file) this is
Syntax.implem, for an interface (.mli file) this is
Syntax.interf, and for code entered at the toplevel, this is
Syntax.top_phrase. Interfaces are beyond the scope of this tutorial, so we will focus on the other two:
$ ocaml camlp4o.cma # open Camlp4.PreCast ;; # Gram.Entry.print Format.std_formatter Syntax.implem ;; implem: [ LEFTA [ "#"; a_LIDENT; opt_expr; semi | EOI | str_item; semi; SELF ] ] - : unit = () # Gram.Entry.print Format.std_formatter Syntax.top_phrase ;; top_phrase: [ LEFTA [ EOI | phrase ] ] - : unit = () # Gram.Entry.print Format.std_formatter Syntax.phrase ;; phrase: [ LEFTA [ "#"; a_LIDENT; opt_expr; semi | str_item; semi ] ] - : unit = () # Gram.Entry.print Format.std_formatter Syntax.str_item ;; str_item: [ "top" LEFTA [ "let"; "module"; a_UIDENT; module_binding0; "in"; expr | "let"; opt_rec; binding; "in"; expr | "let"; opt_rec; binding ...
What this basically means, is that implementation is a sequence of structure items and directives (terminated by optional semicolons), and that a toplevel phrase is one structure item or directive (or the end-of-input). While directives might still be relevant in the context of our language, structure items are OCaml-specific. We therefore choose to override the grammar at the level of
str_item. This influences both toplevel and implementation code.
This is the most straightforward. We complete our code from above as follows:
Gram.Entry.clear Syntax.str_item ;; EXTEND Gram Syntax.str_item : [ [ s = stmt -> str_item_of_stmt s ] ]; END ;;
We first flush the existing definitions of
str_item and then install a new rule: we parse a statement and translate it to a structure item.
With functors (using Register.OCamlSyntaxExtension)
To do the same with functors instead of the
PreCast module, we have to start our definition differently. Instead of opening that module in the above code, we will write the following things.
First we import the camlp4 module types (such as
Camlp4Syntax etc.), and the camlp4 token constructors (such as
open Camlp4.Sig ;;
Next we define an
Id module for our syntax (which camlp4 still views as a syntax extension):
module Id = struct let name = "MinimalLanguage" ;; let version = "0.1" ;; end ;;
We can then open our syntax extension module. It is a functor that defines a
Camlp4Syntax in terms of a given
Camlp4Syntax. We start by opening the given syntax:
module Minimal (Syntax : Camlp4Syntax) = struct open Syntax;;
Syntax makes a number of modules available, most importantly
Gram. That means we can include the definition of the translation and the parser unmodified.
After those definitions, we can override the
str_item entry point as before. Since we already opened the
Syntax module, we do not need to include that module name in the references to
Gram.Entry.clear str_item ;; EXTEND Gram str_item : [ [ s = stmt -> str_item_of_stmt s ] ]; END ;;
Finally, we close our syntax extension module with:
include Syntax ;; end ;;
Since we include the original
Camlp4Syntax module at the end, our syntax extension functor defines the same syntax as the one that it took as an argument. Nonetheless, the grammar modification statements will be applied at functor application time. (We could have just included the
Syntax in the beginning of the definition of
Minimal, but then our definition of
expr would have overridden the existing
expr value and this would have clashed.)
Finally, we register this syntax extension and its
Id module. This occurs as a side-effect of a functor application, so we can do it as a local module definition that is never used.
let module M = Camlp4.Register.OCamlSyntaxExtension(Id)(Minimal) in () ;;
We can compile the syntax extension (defined either with or without functors) using
ocamlc -c -I +camlp4 -pp camlp4of minimal.ml
Afterwards, it can be used with the
camlp4o binary. For example, if
test.ml contains the example program given at the beginning of the article, we can see the generated code:
$ camlp4o minimal.cmo test.ml let name = "world" let _ = print_endline ("Hello" ^ (" " ^ name))
Or compile and run it:
$ ocamlc -o test -pp "camlp4o minimal.cmo" test.ml && ./test Hello world
camlp4o.cma, it can be used to create a toplevel for our language:
$ ocaml camlp4o.cma minimal.cmo # def name = "world" ;; val name : string = "world" # print "Hello", name ;; Hello world - : unit = () # 1+1 ;; Parse error: illegal begin of top_phrase
Notice that semantic errors detected by OCaml will be reported with appropriate source locations, for example, a program with the single line:
will produce the error:
$ ocamlc -c -pp "camlp4o minimal.cmo" foo.ml File "foo.ml", line 1, characters 6-9: Unbound value foo
Make a new grammar using the same lexer and token type
open Camlp4.PreCast;; module Gram = MakeGram(Lexer);; ...
Then you have to write a main function for your parser, camlp4 cannot do it for you.
Make a new lexer and token type and then make a new grammar