bnf2turtle -- write a turtle version of an EBNF grammar

Submitted by connolly on Fri, 2006-02-10 01:11. :: |

In order to cross one of the few remaining t's on the SPARQL spec, I wrote bnf2turtle.py today. It turned out to be such a nice piece of code that I elaborated the module documentation using ReStructuredText. It's checked into the SPARQL spec editor's draft materials, but I'll probably move it to the swap codebase presently. Meanwhile, here's the formatted version of the documentation:

Author: Dan Connolly
Version: $Revision: 1.13 $ of 2006-02-10
Copyright: W3C Open Source License Share and enjoy.

Usage

Invoke a la:

python bnf2turtle.py foo.bnf >foo.ttl

where foo.bnf is full of lines like:

[1] document ::= prolog element Misc*

as per the XML formal grammar notation. The output is Turtle - Terse RDF Triple Language:

:document rdfs:label "document"; rdf:value "1";
 rdfs:comment "[1] document ::= prolog element Misc*";
 a g:NonTerminal;
  g:seq (
    :prolog
    :element
    [ g:star
      :Misc
     ]
   )
.

Motivation

Many specifications include grammars that look formal but are not actually checked, by machine, against test data sets. Debugging the grammar in the XML specification has been a long, tedious manual process. Only when the loop is closed between a fully formal grammar and a large test data set can we be confident that we have an accurate specification of a language [1].

The grammar in the N3 design note has evolved based on the original manual transcription into a python recursive-descent parser and subsequent development of test cases. Rather than maintain the grammar and the parser independently, our goal is to formalize the language syntax sufficiently to replace the manual implementation with one derived mechanically from the specification.

[1]and even then, only the syntax of the language.

Related Work

Sean Palmer's n3p announcement demonstrated the feasibility of the approach, though that work did not cover some aspects of N3.

In development of the SPARQL specification, Eric Prud'hommeaux developed Yacker, which converts EBNF syntax to perl and C and C++ yacc grammars. It includes an interactive facility for checking strings against the resulting grammars. Yosi Scharf used it in cwm Release 1.1.0rc1, which includes a SPAQRL parser that is almost completely mechanically generated.

The N3/turtle output from yacker is lower level than the EBNF notation from the XML specification; it has the ?, +, and * operators compiled down to pure context-free rules, obscuring the grammar structure. Since that transformation is straightforwardly expressed in semantic web rules (see bnf-rules.n3), it seems best to keep the RDF expression of the grammar in terms of the higher level EBNF constructs.

Open Issues and Future Work

The yacker output also has the terminals compiled to elaborate regular expressions. The best strategy for dealing with lexical tokens is not yet clear. Many tokens in SPARQL are case insensitive; this is not yet captured formally.

The schema for the EBNF vocabulary used here (g:seq, g:alt, ...) is not yet published; it should be aligned with swap/grammar/bnf and the bnf2html.n3 rules (and/or the style of linked XHTML grammar in the SPARQL and XML specificiations).

It would be interesting to corroborate the claim in the SPARQL spec that the grammar is LL(1) with a mechanical proof based on N3 rules.

Background

The N3 Primer by Tim Berners-Lee introduces RDF and the Semantic web using N3, a teaching and scribbling language. Turtle is a subset of N3 that maps directly to (and from) the standard XML syntax for RDF.

I started with a kludged and broken algorithm for handling the precedence of | vs concatenation in EBNF rules; for a moment I thought the task required a yacc-like LR parser, but then I realized recursive descent would work well enough. A dozen or so doctests later, it did indeed work. I haven't checked the resulting grammar against the SPARQL tests yet, but it sure looks right.

Then I wondered how much of the formal grammar notation from the XML spec I hadn't covered, so I tried it out on the XML grammar (after writing a 20 line XSLT transformation to extract the grammar from the XML REC) and it worked the first time! So I think it's reasonably complete, though it has a few details that are hard-coded to SPARQL.

See also: cwm-talk discussion, swig chump entry.