Algorithms That Govern Our Lives

Turntable.fm defies this trend, and that’s why it’s so good. A minor victory against our algorithmic overlords.

(Source: video.nextconf.eu)

Deal with it.

npm install jison

Tinkering with new projects should be easy. You should be able to click a link or invoke a command and BAM, there goes your Sunday afternoon. I’m not sure how many people build parsers on the weekend, but hey, Jison is easy to install now anyway.

If you’re using Node.js you should already have the excellent npm installed. If not, that’s easy:

curl -L http://github.com/isaacs/npm/tarball/master | tar xz --strip 1
cd npm
sudo make

Now install jison:

sudo npm install jison

Crazy easy. Now go read the docs for challenges.

Lambda Calculus Evaluator

After writing Jison, I began more experimentation with other facets of programming language design, namely interpreters. I was first exposed to their workings two years ago in a programming languages course at my university, where we wrote an interpreter for a minimal functional language using SML. As an experiment, I used the same technique to evaluate lambda calculus expressions.

Well, that was fun and all, but since I’m doing most of my language experimentation in JavaScript these days, I decided to port the lambda calculus evaluator to JavaScript (and put up a web interface.) It uses call-by-value evaluation semantics, as did the SML original. I used Jison for parsing, naturally.

Sure, it’s Turing Complete, but I doubt you would want to (or could) code your next blogging engine using it.

Anyway, have fun!

Replication Speed: CouchDB vs. MongoDB

I’m posting this because I hadn’t seen this comparison done before. Therefore, it might be useful to someone else. Maybe.

This was a rough test to see how long a slave would take to catch up with its master, which was sustaining about 500 document insertions per second with about 5 million documents already stored. The general verdict was that MongoDB replication is much faster, at least for this particular data model (i.e. logging data.) For CouchDB, inserting millions of small documents is probably not the most efficient way to do things.

I set up Ubuntu 9.10 VMs with 2 GB of RAM each for the master and two slaves. The first slave was set to replicate with the master from the beginning, to see if it ever lags (it was negligible for both DBMSs.) The other slave was used to see how long replication took to catch up.

CouchDB (v0.11)

For CouchDB, after 5,015,630 documents had been inserted over the course of 2 hours and 48 minutes, I started continuous replication on the third node and waited for it to catch up.

At 9,266,860 documents total, the third node had finally caught up with the master, after another 2 hours and 47 minutes, almost the same amount of time it had been offline.

For shorter intervals, the rate of replication was better. For example, it took 16 minutes for an empty database to catch up to a master that had been running for 30 minutes sustaining 500/document insertions per second.

MongoDB (v1.4)

MongoDB took considerably less time, completing the sync of 5,015,639 documents in under 7 minutes. Roughly half of the time was spent creating the new databases on disk and filling them with zeros, and the other half applying document inserts.

If shutdown improperly the mongod node needs to be repaired (with mongod --repair), which took around 11 minutes for 5 million records, ~4 GB of data. There really isn’t an improper way to shutdown CouchDB, so you can boot and start replicating immediately.

Conclusion

We’ll probably use MongoDB for dumping log data. It’s simple enough and the data is not exactly mission critical, so MongoDB’s shortcomings (durability, master-master limitations) aren’t enough of a deterrent.

I really like CouchDB’s replication system. It reminds me of git and allows for the same fully distributed model, but our cluster is not large or diverse enough to really take advantage of those features. I would have picked CouchDB if not for the dramatic difference in replication speed. But hey, no one DBMS is a hammer for all nails.

Extra Notes on Size

Documents were all clones of this JSON:

{"play_time":"19:02:20","file_name":"8821_20100224-1100.xml","playlist":"none","event":"playlist","player_id":9861,"play_date":"2010-03-02"}

CouchDB

  • Master: 5,015,630 docs, 5.0 GB
  • Slave: 5,015,630 docs, 9.3 GB

CouchDB after sync completed (compacted)

  • Master: 9,266,860 docs, 7.4 GB (3.6 GB)
  • Slave: 9,266,860 docs, 18.6 GB (3.6 GB)
  • Slave2: 9,266,860 docs, 13.9 GB (3.6 GB)

MongoDB was the same size on all nodes, or just under 4 GB each after 5 million+ documents (as you may know, MongoDB adds new database chunks up to 2 GB as needed.)

Slides and code from “Build Your Own Programming Language with JavaScript”

I gave a Track B presentation this year at JSConf titled “Build Your Own Programming Language with JavaScript”. The conference was incredible, but I’ll post about that later, perhaps. I’ve pushed the code and slides from the talk to github for people to play with.

A Common, Universal Scripting Language

Where is it? Hint: JavaScript. There’s an interpreter on every consumer platform and it’s already the most widely used scripting language in the world[citation needed]. The core language is also very simple, and beginner friendly by design. Heck, every programmer already knows about half of the language through dealing with JSON. Throw in functions and you can get by.

It may not be the ideal scripting language, but it is the closest we have to a universal one.

Localized JSON Lint

I stumbled upon jsonlint.com recently, a service that sends your JSON data to a server for validation and returns the results. I decided to create a pure JavaScript version, which makes it is easy to host your self. It also comes as a CommonJS module in case you want to include it in some other tool.

CoffeeScript compiler now uses Jison

The new CoffeeScript self compiler uses Jison for generating the parser. The whole build script runs on Node.

Jeremy, the creator of CoffeeScript, spurred me to make some quick performance improvements (and bug fixes) to Jison so that the CoffeeScript grammar could build in a timely fashion. And while there are still many improvements to be made, handling such a large and complex grammar is quite a feat!

Jeremy also reported that parsing times are quick, though that’s more of a testament to v8, I’m sure.

Jison, now with more Bison flavor.

I pushed a new version of Jison (0.1.5) the other day that adds some interesting features, namely the ability to use Bison-like grammar definitions and Flex-like lexer definitions.

Bison

Bison flavored grammars

The following is a valid Bison grammar:

%token NUMBER

%%

E
    : E '+' T
    | T
    ;

T
    : NUMBER
    ;

Now it is also a valid Jison grammar. You can generate a parser using it as you would with the JSON format:

narwhal bin/jison examples/calculator.jison

This will generate just the parser, so you won’t be able to parse anything with it just yet (we’ll need a lexer too, which I detail later.) And as you can see, the file format ends with a .jison extension to distinguish it, as it is not fully compatible with Bison.

Jison will ignore %token declarations, as seen in the above grammar. It only recognizes %left, %right, %nonassoc, %start, and %prec declarations, which carry the same semantics as with Bison.

Semantic Actions

Semantic actions should look familiar:

/* description: Parses end executes mathematical expressions. */

%left '+' '-'
%left '*' '/'
%left '^'
%left UMINUS

%%

S
    : e EOF
        {print($1); return $1;}
    ;

e
    : e '+' e
        {$$ = $1+$3;}
    | e '-' e
        {$$ = $1-$3;}
    | e '*' e
        {$$ = $1*$3;}
    | e '/' e
        {$$ = $1/$3;}
    | e '^' e
        {$$ = Math.pow($1, $3);}
    | '-' e
        {$$ = -$2;} %prec UMINUS
    | '(' e ')'
        {$$ = $2;}
    | NUMBER
        {$$ = Number(yytext);}
    | E
        {$$ = Math.E;}
    | PI
        {$$ = Math.PI;}
    ;

This is valid for both Bison and Jison. The difference is that Jison does not attempt to match braces inside of actions like Bison does, so you’ll have to wrap your action in a different set of braces if they contain a closing brace, e.g.:

nonterminal
    : TOKEN
        {{$$ = '}';}}

In this case we wrapped double braces around the action so it knows the single brace is not the end of the action.

The nicest thing about this format over JSON is that porting Bison grammars is much simpler, though you’ll still have to remove all of the C code and duplicate the logic in a separate JavaScript module. Jison also doesn’t handle semantic value data types (not really necessary with a dynamically typed language like JavaScript,) mid-rule actions and error recovery, yet, among other things. Feel free to post an issue requesting your favorite missing feature.

A bit of an aside, but I found the process of bootstrapping this really interesting. You end up with a grammar that can parse itself.

Named Semantic Values

I decided to add some sugar for accessing semantic values within actions by name instead of position. Normally, you’d have to use the position of the corresponding nonterminal in the production, prefixed by a dollar sign $, e.g.:

 exp:    ...
         | '(' exp ')'
             { $$ = $2; }

Now, you can also access the value by using the name of the nonterminal instead of its position, e.g.:

 exp:    ...
         | '(' exp ')'
             { $$ = $exp; }

If the rule is ambiguous (the nonterminal appears more than once,) append a number to the end of the nonterminal name to disambiguate the desired value:

 exp:    ...
         | exp '+' exp
             { $$ = $exp1 + $exp2; }

Association by name leads to a looser coupling (and is easier to grok.)

Defining Lexers

Note that the new grammar format does not allow embedded lexer definitions. This is because there’s a new, separate format for defining lexers, which also serves to keep things more modular.

The format has a more restricted form of regular expression syntax, but no less expressive. The main difference is that strings that should match exactly must be wrapped in quotes. Here’s an example of the format:

%%
\s+                   {/* skip whitespace */}
[0-9]+("."[0-9]+)?\b  {return 'NUMBER';}
"*"                   {return '*';}
"/"                   {return '/';}
"-"                   {return '-';}
"+"                   {return '+';}
"^"                   {return '^';}
"("                   {return '(';}
")"                   {return ')';}
"PI"                  {return 'PI';}
"E"                   {return 'E';}
<<EOF>>               {return 'EOF';}

You can also use macros:

D                     [0-9]

%%
\s+                   {/* skip whitespace */}
{D}+("."{D}+)?\b      {return 'NUMBER';}
"*"                   {return '*';}
"/"                   {return '/';}
"-"                   {return '-';}
"+"                   {return '+';}
"^"                   {return '^';}
"("                   {return '(';}
")"                   {return ')';}
"PI"                  {return 'PI';}
"E"                   {return 'E';}
<<EOF>>               {return 'EOF';}

The format ultimately compiles to JavaScript regular expressions, so use the appropriate escape sequences and such. The file extension for this format is .jisonlex.

JSON-to-Jison

The update includes a utility for converting grammars from the old JSON format to the new Jison format. Just run your JSON file through json2jison to perform the conversion, e.g.:

narwhal bin/json2jison examples/basic.json

This will create a file called basic.jison in the current directory, which should be ready to run through Jison as-is.

If the JSON grammar has an embedded lex specification, json2jison will create a .jisonlex file also, though it will likely need correction (the conversion is lossy, to say the least.) E.g, running:

narwhal bin/json2jison examples/calculator.json

will create calculator.jison and calculator.jisonlex. Compare the generated calculator.jisonlex to the correct one found in examples/calculator.jisonlex to see the difference. Once the syntax of the lex specification is corrected, you can generate a complete parser like so:

narwhal bin/jison calculator.jison calculator.jisonlex

passing both the grammar and lexer specs.

Node Compatible

I should note that Jison is now compatible with node.js when included using require from a module. The examples from the README should work on any CommonJS system that implements the Module specification.

Tim Caswell used Jison as part of his CoffeeScript compiler that runs on Node.

In the future

There’s tons more to implement from Bison, and the grammar should also become more compatible over time, in terms of syntax and features that are supported by Jison.

I also have many topics to cover on using Jison, which I hope to write about more.

Until then, stay parsey, my friends.