shake, with fries

by Zach Carter

Posts tagged language

Jan29

Jison, now with more Bison flavor.

javascript programming jison bison language parser lexer | comments

I pushed a new version of Jison (0.1.5) the other day that adds some interesting features, namely the ability to use Bison-like grammar definitions and Flex-like lexer definitions.

Bison

Bison flavored grammars

The following is a valid Bison grammar:

%token NUMBER

%%

E
    : E '+' T
    | T
    ;

T
    : NUMBER
    ;

Now it is also a valid Jison grammar. You can generate a parser using it as you would with the JSON format:

narwhal bin/jison examples/calculator.jison

This will generate just the parser, so you won't be able to parse anything with it just yet (we'll need a lexer too, which I detail later.) And as you can see, the file format ends with a .jison extension to distinguish it, as it is not fully compatible with Bison.

Jison will ignore %token declarations, as seen in the above grammar. It only recognizes %left, %right, %nonassoc, %start, and %prec declarations, which carry the same semantics as with Bison.

Semantic Actions

Semantic actions should look familiar:

/* description: Parses end executes mathematical expressions. */

%left '+' '-'
%left '*' '/'
%left '^'
%left UMINUS

%%

S
    : e EOF
        {print($1); return $1;}
    ;

e
    : e '+' e
        {$$ = $1+$3;}
    | e '-' e
        {$$ = $1-$3;}
    | e '*' e
        {$$ = $1*$3;}
    | e '/' e
        {$$ = $1/$3;}
    | e '^' e
        {$$ = Math.pow($1, $3);}
    | '-' e
        {$$ = -$2;} %prec UMINUS
    | '(' e ')'
        {$$ = $2;}
    | NUMBER
        {$$ = Number(yytext);}
    | E
        {$$ = Math.E;}
    | PI
        {$$ = Math.PI;}
    ;

This is valid for both Bison and Jison. The difference is that Jison does not attempt to match braces inside of actions like Bison does, so you'll have to wrap your action in a different set of braces if they contain a closing brace, e.g.:

nonterminal
    : TOKEN
        {{$$ = '}';}}

In this case we wrapped double braces around the action so it knows the single brace is not the end of the action.

The nicest thing about this format over JSON is that porting Bison grammars is much simpler, though you'll still have to remove all of the C code and duplicate the logic in a separate JavaScript module. Jison also doesn't handle semantic value data types (not really necessary with a dynamically typed language like JavaScript,) mid-rule actions and error recovery, yet, among other things. Feel free to post an issue requesting your favorite missing feature.

A bit of an aside, but I found the process of bootstrapping this really interesting. You end up with a grammar that can parse itself.

Named Semantic Values

I decided to add some sugar for accessing semantic values within actions by name instead of position. Normally, you'd have to use the position of the corresponding nonterminal in the production, prefixed by a dollar sign $, e.g.:

 exp:    ...
         | '(' exp ')'
             { $$ = $2; }

Now, you can also access the value by using the name of the nonterminal instead of its position, e.g.:

 exp:    ...
         | '(' exp ')'
             { $$ = $exp; }

If the rule is ambiguous (the nonterminal appears more than once,) append a number to the end of the nonterminal name to disambiguate the desired value:

 exp:    ...
         | exp '+' exp
             { $$ = $exp1 + $exp2; }

Association by name leads to a looser coupling (and is easier to grok.)

Defining Lexers

Note that the new grammar format does not allow embedded lexer definitions. This is because there's a new, separate format for defining lexers, which also serves to keep things more modular.

The format has a more restricted form of regular expression syntax, but no less expressive. The main difference is that strings that should match exactly must be wrapped in quotes. Here's an example of the format:

%%
\s+                   {/* skip whitespace */}
[0-9]+("."[0-9]+)?\b  {return 'NUMBER';}
"*"                   {return '*';}
"/"                   {return '/';}
"-"                   {return '-';}
"+"                   {return '+';}
"^"                   {return '^';}
"("                   {return '(';}
")"                   {return ')';}
"PI"                  {return 'PI';}
"E"                   {return 'E';}
<<EOF>>               {return 'EOF';}

You can also use macros:

D                     [0-9]

%%
\s+                   {/* skip whitespace */}
{D}+("."{D}+)?\b      {return 'NUMBER';}
"*"                   {return '*';}
"/"                   {return '/';}
"-"                   {return '-';}
"+"                   {return '+';}
"^"                   {return '^';}
"("                   {return '(';}
")"                   {return ')';}
"PI"                  {return 'PI';}
"E"                   {return 'E';}
<<EOF>>               {return 'EOF';}

The format ultimately compiles to JavaScript regular expressions, so use the appropriate escape sequences and such. The file extension for this format is .jisonlex.

JSON-to-Jison

The update includes a utility for converting grammars from the old JSON format to the new Jison format. Just run your JSON file through json2jison to perform the conversion, e.g.:

narwhal bin/json2jison examples/basic.json

This will create a file called basic.jison in the current directory, which should be ready to run through Jison as-is.

If the JSON grammar has an embedded lex specification, json2jison will create a .jisonlex file also, though it will likely need correction (the conversion is lossy, to say the least.) E.g, running:

narwhal bin/json2jison examples/calculator.json

will create calculator.jison and calculator.jisonlex. Compare the generated calculator.jisonlex to the correct one found in examples/calculator.jisonlex to see the difference. Once the syntax of the lex specification is corrected, you can generate a complete parser like so:

narwhal bin/jison calculator.jison calculator.jisonlex

passing both the grammar and lexer specs.

Node Compatible

I should note that Jison is now compatible with node.js when included using require from a module. The examples from the README should work on any CommonJS system that implements the Module specification.

Tim Caswell used Jison as part of his CoffeeScript compiler that runs on Node.

In the future

There's tons more to implement from Bison, and the grammar should also become more compatible over time, in terms of syntax and features that are supported by Jison.

I also have many topics to cover on using Jison, which I hope to write about more.

Until then, stay parsey, my friends.

Continue reading »

Dec28

Jison: An API for creating parsers in JavaScript

javascript programming jison bison language parser lexer | comments

Bottom-up parsing

When you hear of someone writing a parser, it is usually of the top-down, recursive descent variety. Visualizing the source as an abstract syntax tree (AST), this means the parser tries to match the root, and by doing so, may recursively try to match branches and stems of the AST as it works its way down.

Bottom-up parsing is different:

Bottom-up parsing (also known as shift-reduce parsing) is a strategy for analyzing unknown data relationships that attempts to identify the most fundamental units first, and then to infer higher-order structures from them. It attempts to build trees upward toward the start symbol. It occurs in the analysis of both natural languages and computer languages. - Wikipedia

Because most bottom-up parsers require a parse table (a complex matrix of instructions for the parser PDA) they are usually not handwritten. Instead, a parser generator such as Yacc, Bison, ocamlyacc, etc are used to generate the parser in the target language. For Bison and Yacc, the language is C, for ocamlyacc, the language is, well, ocaml.

Jison

Jison generates bottom-up parsers in JavaScript. Its API is similar to Bison's, hence the name. It supports many of Bison's major features, plus some of its own. If you are new to parser generators such as Bison, and Context-free Grammars in general, a good introduction is found in the Bison manual. If you already know Bison, Jison should be easy to pickup.

A brief warning before proceeding: the API is ridiculously unstable right now. The goal is to mirror Bison where it makes sense, but we're not even there yet. Also, optimization has not been a main focus as of yet.

Briefly, Jison takes a JSON encoded grammar specification and outputs a JavaScript file capable of parsing the language described by that grammar specification. You can then use the generated script to parse inputs and accept, reject, or perform actions based on the input.

Installation

Prerequisite: To run Jison from the command line, you'll need to have Narwhal installed and available from your PATH.

Clone the github repository:

git clone git://github.com/zaach/jison.git

Using Jison from the command line

Now you're ready to generate some parsers:

cd jison
narwhal bin/jison examples/calculator.json

This will generate a calculator.js file in your current working directory. This file can be used to parse an input file, like so:

echo "2^32 / 1024" > testcalc
narwhal calculator.js testcalc

This will print out 4194304.

Using Jison from a CommonJS module

You can generate parsers programatically from JavaScript as well. Assuming Jison is in your commonjs environment's load path:

// mygenerator.js
var Parser = require("jison").Parser;

var grammar = {
    "lex": {
        "rules": [
           ["\\s+", "/* skip whitespace */"],
           ["[a-f0-9]+", "return 'HEX';"]
        ]
    },

    "bnf": {
        "hex_strings" :[ "hex_strings HEX",
                         "HEX" ]
    }
};

var parser = new Parser(grammar);

// generate source, ready to be written to disk
var parserSource = parser.generate();

// you can also use the parser directly from memory

// returns true
parser.parse("adfe34bc e82a");

// throws lexical error
parser.parse("adfe34bc zxg");

Using the generated parser

So, you have generated your parser through the command line or JavaScript API and have saved it to disk. Now it can be put to use.

As demonstrated before, the parser can be used from the command line:

narwhal calculator.js testcalc

Though, more ideally, the parser will be a dependency of another module. You can require it from another module like so:

// mymodule.js
var parser = require("./calculator").parser;

function exec (input) {
    return parser.parse(input);
}

var twenty = exec("4 * 5");

Or, more succinctly:

// mymodule.js
function exec (input) {
    return require("./calculator").parse(input);
}

var twenty = exec("4 * 5");

Specifying a language

The process of parsing a language involves two phases: lexical analysis (tokenizing) and parsing, which the Lex/Yacc and Flex/Bison combinations are famous for. Both lexing and parsing information can be included in the Jison grammar specification for a language.

For example, here is calculator.json:

{
    "comment": "Parses end executes mathematical expressions.",

    "lex": {
        "rules": [
           ["\\s+",                    "/* skip whitespace */"],
           ["[0-9]+(?:\\.[0-9]+)?\\b", "return 'NUMBER';"],
           ["\\*",                     "return '*';"],
           ["\\/",                     "return '/';"],
           ["-",                       "return '-';"],
           ["\\+",                     "return '+';"],
           ["\\^",                     "return '^';"],
           ["\\(",                     "return '(';"],
           ["\\)",                     "return ')';"],
           ["PI\\b",                   "return 'PI';"],
           ["E\\b",                    "return 'E';"],
           ["$",                       "return 'EOF';"]
        ]
    },

    "operators": [
        ["left", "+", "-"],
        ["left", "*", "/"],
        ["left", "^"],
        ["left", "UMINUS"]
    ],

    "bnf": {
        "S" :[[ "e EOF",   "print($1); return $1;"  ]],

        "e" :[[ "e + e",   "$$ = $1+$3;" ],
              [ "e - e",   "$$ = $1-$3;" ],
              [ "e * e",   "$$ = $1*$3;" ],
              [ "e / e",   "$$ = $1/$3;" ],
              [ "e ^ e",   "$$ = Math.pow($1, $3);" ],
              [ "- e",     "$$ = -$2;", {"prec": "UMINUS"} ],
              [ "( e )",   "$$ = $2;" ],
              [ "NUMBER",  "$$ = Number(yytext);" ],
              [ "E",       "$$ = Math.E;" ],
              [ "PI",      "$$ = Math.PI;" ]]
    }
}

This is all that is needed for Jison to generate a parser capable of recognizing the language and executing the semantic actions.

More examples can be found in the examples/ directory.

JSON isn't so bad, but a more author friendly format is in the works. It will be similar to Bison's .y files.

Sharing scope

In Bison, code is expected to be lexically defined within the scope of the semantic actions. E.g., chunks of code may be included in the generated parser source, which are available from semantic actions.

Jison is more modular. Instead of pulling code into the generated module, the generated module is expected to be required and used by other modules. This means that if you want to expose functionality to the semantic actions, you can't rely on it being available through lexical scoping. Instead, the parser has a yy property which is exposed to actions as the yy free variable. Any functionality attached to this property is available in both lexical and semantic actions through the yy free variable.

An example from orderly.js:

var parser = require("./orderly/parse").parser;

// set parser's shared scope
parser.yy = require("./orderly/scope");

// returns the JSON object
var parse = exports.parse = function (input) {
    return parser.parse(input);
};
...

The scope module contains logic for building data structures, which is used within the semantic actions.

Lexical Analysis

Jison includes a rather rudimentary lexer, though any module that supports the basic lexer API could be used in its place. Jison's lexer uses the lex key of the JSON grammar spec, where the rules for matching a token are defined along with the action to execute on a match. Usually, the action will return the token which is used by the Jison parser. A custom lexer could be used instead with it's own methods of tokenizing.

TODO: More on this.

Parsing algorithms

Like Bison, Jison can recognize languages described by LALR(1) grammars, though it also has modes for LR(0), SLR(1), and LR(1). It also has a special mode for generating LL(1) parse tables (requested by my professor,) and could be extended to generate a recursive descent parser for LL(k) languages in the future. But, for now, Jison is geared toward bottom-up parsing.

*LR(1) mode is currently not practical for use with anything other than toy grammars, but that is entirely a consequence of the algorithm used, and may change in the future.

Real world example

I wrote a parser for Orderly using Jison. The benefits I found were:

  • If modeled after the normative language grammar, it is guaranteed to recognize the correct language.
  • Adding new syntax is straight forward.
  • It was much faster to develop than if I were to attempt implementing a (top-down) parser from scratch. But for others not used to grammar specifications, this might not be the case.
  • Improvements to Jison mean you only have to regenerate your parsing module to see the benefits.

Pie-in-the-sky ideas for the future

  • Practical, minimal LR(1) parsing
  • GLR parsing, which can recognize ambiguous grammars and natural languages
Continue reading »