Jison: An API for creating parsers in JavaScript

Bottom-up parsing

When you hear of someone writing a parser, it is usually of the top-down, recursive descent variety. Visualizing the source as an abstract syntax tree (AST), this means the parser tries to match the root, and by doing so, may recursively try to match branches and stems of the AST as it works its way down.

Bottom-up parsing is different:

Bottom-up parsing (also known as shift-reduce parsing) is a strategy for analyzing unknown data relationships that attempts to identify the most fundamental units first, and then to infer higher-order structures from them. It attempts to build trees upward toward the start symbol. It occurs in the analysis of both natural languages and computer languages. - Wikipedia

Because most bottom-up parsers require a parse table (a complex matrix of instructions for the parser PDA) they are usually not handwritten. Instead, a parser generator such as Yacc, Bison, ocamlyacc, etc are used to generate the parser in the target language. For Bison and Yacc, the language is C, for ocamlyacc, the language is, well, ocaml.

Jison

Jison generates bottom-up parsers in JavaScript. Its API is similar to Bison’s, hence the name. It supports many of Bison’s major features, plus some of its own. If you are new to parser generators such as Bison, and Context-free Grammars in general, a good introduction is found in the Bison manual. If you already know Bison, Jison should be easy to pickup.

A brief warning before proceeding: the API is ridiculously unstable right now. The goal is to mirror Bison where it makes sense, but we’re not even there yet. Also, optimization has not been a main focus as of yet.

Briefly, Jison takes a JSON encoded grammar specification and outputs a JavaScript file capable of parsing the language described by that grammar specification. You can then use the generated script to parse inputs and accept, reject, or perform actions based on the input.

Installation

Prerequisite: To run Jison from the command line, you’ll need to have Narwhal installed and available from your PATH.

Clone the github repository:

git clone git://github.com/zaach/jison.git

Using Jison from the command line

Now you’re ready to generate some parsers:

cd jison
narwhal bin/jison examples/calculator.json

This will generate a calculator.js file in your current working directory. This file can be used to parse an input file, like so:

echo "2^32 / 1024" > testcalc
narwhal calculator.js testcalc

This will print out 4194304.

Using Jison from a CommonJS module

You can generate parsers programatically from JavaScript as well. Assuming Jison is in your commonjs environment’s load path:

// mygenerator.js
var Parser = require("jison").Parser;

var grammar = {
    "lex": {
        "rules": [
           ["\\s+", "/* skip whitespace */"],
           ["[a-f0-9]+", "return 'HEX';"]
        ]
    },

    "bnf": {
        "hex_strings" :[ "hex_strings HEX",
                         "HEX" ]
    }
};

var parser = new Parser(grammar);

// generate source, ready to be written to disk
var parserSource = parser.generate();

// you can also use the parser directly from memory

// returns true
parser.parse("adfe34bc e82a");

// throws lexical error
parser.parse("adfe34bc zxg");

Using the generated parser

So, you have generated your parser through the command line or JavaScript API and have saved it to disk. Now it can be put to use.

As demonstrated before, the parser can be used from the command line:

narwhal calculator.js testcalc

Though, more ideally, the parser will be a dependency of another module. You can require it from another module like so:

// mymodule.js
var parser = require("./calculator").parser;

function exec (input) {
    return parser.parse(input);
}

var twenty = exec("4 * 5");

Or, more succinctly:

// mymodule.js
function exec (input) {
    return require("./calculator").parse(input);
}

var twenty = exec("4 * 5");

Specifying a language

The process of parsing a language involves two phases: lexical analysis (tokenizing) and parsing, which the Lex/Yacc and Flex/Bison combinations are famous for. Both lexing and parsing information can be included in the Jison grammar specification for a language.

For example, here is calculator.json:

{
    "comment": "Parses end executes mathematical expressions.",

    "lex": {
        "rules": [
           ["\\s+",                    "/* skip whitespace */"],
           ["[0-9]+(?:\\.[0-9]+)?\\b", "return 'NUMBER';"],
           ["\\*",                     "return '*';"],
           ["\\/",                     "return '/';"],
           ["-",                       "return '-';"],
           ["\\+",                     "return '+';"],
           ["\\^",                     "return '^';"],
           ["\\(",                     "return '(';"],
           ["\\)",                     "return ')';"],
           ["PI\\b",                   "return 'PI';"],
           ["E\\b",                    "return 'E';"],
           ["$",                       "return 'EOF';"]
        ]
    },

    "operators": [
        ["left", "+", "-"],
        ["left", "*", "/"],
        ["left", "^"],
        ["left", "UMINUS"]
    ],

    "bnf": {
        "S" :[[ "e EOF",   "print($1); return $1;"  ]],

        "e" :[[ "e + e",   "$$ = $1+$3;" ],
              [ "e - e",   "$$ = $1-$3;" ],
              [ "e * e",   "$$ = $1*$3;" ],
              [ "e / e",   "$$ = $1/$3;" ],
              [ "e ^ e",   "$$ = Math.pow($1, $3);" ],
              [ "- e",     "$$ = -$2;", {"prec": "UMINUS"} ],
              [ "( e )",   "$$ = $2;" ],
              [ "NUMBER",  "$$ = Number(yytext);" ],
              [ "E",       "$$ = Math.E;" ],
              [ "PI",      "$$ = Math.PI;" ]]
    }
}

This is all that is needed for Jison to generate a parser capable of recognizing the language and executing the semantic actions.

More examples can be found in the examples/ directory.

JSON isn’t so bad, but a more author friendly format is in the works. It will be similar to Bison’s .y files.

Sharing scope

In Bison, code is expected to be lexically defined within the scope of the semantic actions. E.g., chunks of code may be included in the generated parser source, which are available from semantic actions.

Jison is more modular. Instead of pulling code into the generated module, the generated module is expected to be required and used by other modules. This means that if you want to expose functionality to the semantic actions, you can’t rely on it being available through lexical scoping. Instead, the parser has a yy property which is exposed to actions as the yy free variable. Any functionality attached to this property is available in both lexical and semantic actions through the yy free variable.

An example from orderly.js:

var parser = require("./orderly/parse").parser;

// set parser's shared scope
parser.yy = require("./orderly/scope");

// returns the JSON object
var parse = exports.parse = function (input) {
    return parser.parse(input);
};
...

The scope module contains logic for building data structures, which is used within the semantic actions.

Lexical Analysis

Jison includes a rather rudimentary lexer, though any module that supports the basic lexer API could be used in its place. Jison’s lexer uses the lex key of the JSON grammar spec, where the rules for matching a token are defined along with the action to execute on a match. Usually, the action will return the token which is used by the Jison parser. A custom lexer could be used instead with it’s own methods of tokenizing.

TODO: More on this.

Parsing algorithms

Like Bison, Jison can recognize languages described by LALR(1) grammars, though it also has modes for LR(0), SLR(1), and LR(1). It also has a special mode for generating LL(1) parse tables (requested by my professor,) and could be extended to generate a recursive descent parser for LL(k) languages in the future. But, for now, Jison is geared toward bottom-up parsing.

*LR(1) mode is currently not practical for use with anything other than toy grammars, but that is entirely a consequence of the algorithm used, and may change in the future.

Real world example

I wrote a parser for Orderly using Jison. The benefits I found were:

  • If modeled after the normative language grammar, it is guaranteed to recognize the correct language.
  • Adding new syntax is straight forward.
  • It was much faster to develop than if I were to attempt implementing a (top-down) parser from scratch. But for others not used to grammar specifications, this might not be the case.
  • Improvements to Jison mean you only have to regenerate your parsing module to see the benefits.

Pie-in-the-sky ideas for the future

  • Practical, minimal LR(1) parsing
  • GLR parsing, which can recognize ambiguous grammars and natural languages

A Naive, JScript Compatible Getter Pattern

Modern JavaScript implementations let you define getters and setters for properties of objects. The new syntax involved will, of course, make your code incompatible with less capable JavaScript VMs if you chose to use it, but the behavior of the rest of your code will also be different, even when syntactically identical. This little experiment of mine was aimed to address the latter issue (well, at least for getters.)

The following is a naive implementation of getters that would be somewhat compatible with language supported getters in common usage (setters are just not possible!) Perhaps this technique could prove useful in some contexts but the reliance on type coercion makes it dubious…

Take, for example, the behavior of a rectangle:

var rect = new Rectangle(4,5);

rect.area; // returns 20

rect.width = 6;

rect.area; // returns 30

The area property updates dynamically, by way of a getter. With proper getters, you could implement the Rectangle class like so:

function Rectangle(width, height){
  this.width = width;
  this.height = height;

  this.__defineGetter__('area', function (){
    return this.width * this.height;
  });
}

Now let’s see the naive, JScript compatible approach:

function Rectangle(width, height){
  this.width = width;
  this.height = height;

  var that = this;

  var area = function(){};
  area.valueOf = function(){
    return that.width * that.height;
  };

  this.area = area;
}

The key idea here is that area is a complex value that will be coerced into a primitive by most operations, calling its valueOf method. Let’s see how well it works:

var rect = new Rectangle(4,5);

assert(typeof rect.area === 'function'); // it's really a function
assert(typeof +rect.area === 'number');  // but can be coerced into a number

assert(rect.area == 20);                 // type coercion
assert(rect.area !== 20);                // === and !== don't coerce type
assert(rect.area + 5 == 25);             // operator type coercion
assert(rect.area + 5 === 25);            // coercion happens before equality comparison

rect.width = 6;
assert(rect.area == 30);                 // area updates as expected

This is definitely a stretch of the intentions of the language, and as I mentioned, very dubious. But, if you could get away with it you would be able to utilize getter behavior on lesser JavaScript VMs without changing your API. Neat.

Edit: use valueOf, not toString. DUH

A Matter of Philosophy (and necessity)

The article comparing Mootools and jQuery is great — go ahead and take a gander. The author definitely did his research. In fact, there are only a few nitpicks I had with it.

The author demonstrates well his awareness of the different philosophies of each project. The aim of jQuery is to provide a succinct API for dealing with browser shortcomings and inconsistencies, whereas Mootools aims to be a framework for building web applications. This means Mootools includes things such as OOP patterns, and jQuery does not (and I don’t think that will change anytime soon.)

I happen to be a fan of small, modular pieces of functionality that can easily be plugged in when necessary, which is why the jQuery philosophy resonates more with me.*

Intermediate to Advanced Programmers

When the author mentions that intermediate to advanced JavaScript programmers would prefer Mootools, I believe he is speaking in the general sense, for there may also be cases where intermediate to advanced JavaScript programmers have their own methodologies to handle program structure — whether with OOP or something else. In such cases, jQuery might be more appealing precisely because it lacks the extras Mootools provides.

The author understands this point, no doubt, but I missed a more direct articulation of it, as someone in the situation described above.

I can also see how jQuery would appeal to designers and beginners who haven’t delt with, or had a need to deal with the complexities involved in larger application development. jQuery, with its succinct API, draws them in.

JavaScript Style

The author reiterates from time to time that Mootools has a more JavaScript-like style than jQuery. Well, that depends. JavaScript is a multi-paradigm language and depending on which paradigm you lean toward, that statement may prove true for you or it may not. If your style tends to be functional, you might find jQuery fits more snuggly alongside your braces and semicolons. I’ve gone through many stages as a JavaScript programmer, employing different styles, and I can’t say that one style or the other is more JavaScript like. Both prototypal or functional styles feel comfortable to me, though functional style JavaScript is probably less understood or used in general.

The Great Debate - Monkey patching

And then there is the great debate over extending global objects. This can be a deal breaker for some.

Mootools extends global objects to achieve its power and elegance. jQuery uses monadic like wrappers to achieve its power and elegance.

My personal view is that frameworks and third party modules should do as little modification to the global namespace as possible, so as not to conflict with the main application or other modules. If the application developer wants to modify the global namespace, that is their business, but third party code should play nice with others.

Extending global namespaces certainly feels good, but is it more powerful? Not necessarily; you could just as easily pass an object to a function rather than attach the function as a method to the object. But it certainly doesn’t look nearly as nifty.

If only there were a way to encapsulate changes to global objects to only within a certain scope…

That said, Mootools is intentionally exposing these language extensions as part of the framework it provides. If you choose Mootools then you don’t mind those extensions, or would have done similar extensions your self. As long as you are mindful of the potential for conflict, trouble should be avoidable.

Okay, I think I’ve gone on long enough. Great article, Aaron.

*And yet, I like Rails. People are weird…