This is a PEG implementation in JavaScript, which produces pure JavaScript packrat parsers from a PEG description of a grammar.
There is an API for compiling a parser from a grammar, a very simple API for using a parser, and some helpful convenience functions for dealing with the parse trees that a parser produces.
Grammars are given to the parser generator as a parser expression grammar or PEG. The PEG language accepted by the parser generator will be familiar to anyone with a familiarity with PEGs.
The arithmetic grammar from the demo makes a good example of the basic features.
Expr ← Add
Add ← Mult ( S? "+" S? Mult )*
Mult ← Num ( S? "*" S? Num )*
Num ← "0" / [1-9] [0-9]*
S ← [ U+0020 ]+
PEGs are similar to CFGs, but have a more imperative nature: while a CFG describes a language, which may or may not be easy to parse, and may be ambiguous, a PEG describes a deterministic parser. Unextended PEGs cannot describe ambiguous grammars, which makes them an excellent choice for parsing programming languages. Wikipedia's article on PEGs is a good introduction.
TODO: complete documentation of accepted PEG formalism
The following API is used to produce a parser from a PEG. This API is demonstrated in the arithmetic parser demo. This API and all its dependencies are provided in the file PEG_generator.js.
generateParser(peg,opts)Returns either a compiled parser, which is a pure JavaScript program with no dependencies, or an error.
The return value will be an array, either [true, generateParserThrowing() function which, instead of returning an array, either returns a parser or throws an exception.
Currently the returned errors are not very informative.
The most common reason for failure is that the PEG grammar given as an argument could not be parsed.
The opts argument is an object the properties of which control the code that is generated.
The opts argument is optional and if provided, all the properties are optional.
The defined properties are:
drop
A list of parse rule names to drop from the generated parse tree. Any node generated by such a parse, and all its children, will not appear in the tree. This does not change the way input text is parsed, only the parse tree that is returned. This is commonly used when parsing programming languages to exclude low-level or atomic rules such as FunctionToken or SourceCharacter which are not interesting. By default, the full parse tree is returned.
nocache
For efficiency it is sometimes better to not cache successful or failing match results in the result table. This is a set of rule names that should not be cached. This does not affect the returned parse tree, only the efficiency of the parser. By default all rules are cached in the result table.
prefix
The returned parser will be a function, the name of which will be the start token of the grammar, prefixed with this option. By default no prefix is used.
start
The start rule may be given as an argument. By default the first parse rule in the PEG grammar is taken to be the start rule.
debug
This can be set to generate a parser with some debugging annotations, the details of which are undocumented.
profile
If true, a profiling parser is generated.
Instead of returning a parse tree, it goes through the motions of parsing the input text and records all matches, cache hits and misses per rule name, and returns a table of profiling data.
If suitably representative input is chosen, this profile can be used to heuristically determine a set of rules that should not be cached, which can then be given as the nocache option when generating the actual parser.
The API do to this profile analysis is not yet exported.
Once a parser is generated using the API above, it can be used to parse input.
This and the API above are used together in the arithmetic parsing demo. An example with a more complex grammar is provided by the ECMAScript 5 demo which parses a JavaScript program according to the formal language grammar.
The parsers that are generated by the API above consist of a single function, and have no external dependencies so to use it you include the parser code somewhere in your project and call the function with the input to be parsed.
A generated parser is a function which takes a string as an argument and returns a parse tree or an error.
The return value is an array, either of the form [true,
The result tree format is designed to be efficient in terms of memory. When parsing large inputs with a complex grammar, such as when parsing a moderately sized JavaScript library using the ES5 parser, the generated parse tree can be quite a bit larger in memory than the input string. An API to return parse trees as a stream is planned for the next release.
Each node in the parse tree is represented as an array with three elements. The first element is the index of the parse rule that matched to create this node. If the node is anonymous, this will be -1. These indices can be looked up in the original grammar and correspond to the position in the grammar of each rule, for example if the start rule is the first rule given in the PEG, it will have index 0. The second element is the length of the input that was matched by this node. To determine the position of a node it is necessary to accumulating the matched lengths while descending into the parse tree. The third and final element of each parse tree node is an array of child nodes.
The following example shows a short JavaScript program, the raw parse tree returned by the ES5 parser, and a pretty-printed representation of that tree (generated by the showTree function described below). The first node in this tree, beginning "[0, 23, [..." is the match of the entire text against the Program rule.
function f(){return 42}
[0, 23, [[2, 23, [[-1, 8], [124, 1, []], [132, 1, [[133, 1, []]]], [-1, 3], [1, 9, [[6, 9, [[22, 9, [[-1, 6], [125, 1, []], [33, 2, [[35, 2, [[38, 2, [[40, 2, [[42, 2, [[44, 2, [[46, 2, [[48, 2, [[50, 2, [[53, 2, [[57, 2, [[59, 2, [[61, 2, [[63, 2, [[64, 2, [[65, 2, [[69, 2, [[70, 2, [[71, 2, [[80, 2, [[83, 2, [[84, 2, [[85, 2, [[-1, 1], [86, 1, []]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]], [126, 0, []]]]]]]]]]]]
Program 0-23
FunctionDeclaration 0-23
anonymous 0-8
S 8-9
Identifier 9-10
IdentifierName 9-10
anonymous 10-13
FunctionBody 13-22
Statement 13-22
ReturnStatement 13-22
anonymous 13-19
SnoLB 19-20
Expr 20-22
AssignmentExpr 20-22
ConditionalExpr 20-22
LogicalOrExpr 20-22
LogicalAndExpr 20-22
BitwiseOrExpr 20-22
BitwiseXOrExpr 20-22
BitwiseAndExpr 20-22
EqualityExpr 20-22
RelationalExpr 20-22
ShiftExpr 20-22
AdditiveExpr 20-22
MultiplicativeExpr 20-22
UnaryExpr 20-22
PostfixExpr 20-22
LeftHandSideExpr 20-22
NewExpr 20-22
MemberExpr 20-22
PrimaryExpr 20-22
Literal 20-22
NumericLiteral 20-22
DecimalLiteral 20-22
DecimalIntegerLiteral 20-22
anonymous 20-21
DecimalDigit 21-22
EOS 22-22
Currently the only exported function is showTree.
It takes two arguments, a parse tree, and a rule name array, and returns a pretty-printed ASCII representation of the parse tree like the one shown above.
The name array is used to correlate the parse rule indices with rule names, and it is provided in generated parsers as a names property of the parser function.