connolly's blog

Fun with Embedded RDF and DOAP for the GRDDL profile

Submitted by connolly on Sat, 2006-02-25 01:13. ::

I just updated the GRDDL profile page to document the easy way to use GRDDL before the hard way.

This past week I swapped in all sorts of travel details and redid the travel schedule on my homepage as a result; I switched to hCalendar rather than my own home-grown schedule dialect while I was at it, and replace my FOAF dialect with Embedded RDF.

Since that went so well, and I don't care to go mucking with XHTML at the DTD level, I took the RDDL and XLink markup out of the GRDDL profile page and replaced it with embedded RDF.

In order to document the domains and ranges of the GRDDL properties, their id attributes had to be on an element that dominates those links, i.e. it had to be on the dd element rather than the previous tt element. That clashes with the way XMDP uses id attributes, so I'm not using XMDP markup either. I'm just using Embedded RDF for the RDF Schema bits.

I figured the link to the mailing list archive deserved to be a 1st-class link, so I used DOAP. The DOAP schema is quite tabulator-happy. Good work there.

Investigating logical reflection, constructive proof, and explicit provability

Submitted by connolly on Thu, 2006-02-16 02:38. :: |

Prompted by a question about RDF schemas for dollars, gallons, and other units, I found myself looking into SUMO again.

The SUMO/wordnet mappings are great, and the SUMO time concepts are backed by a nice KIF formalization of Allen's work, but it's overly constrained; I prefer the cyc style where the before and after relations apply to conferences and events, without indirection via time intervals.

But what got me really thinking was the way holdsDuring takes formulas as arguments, despite the claim that SUMO is written in SUO-KIF, which has pretty traditional first-order syntax and semantics. I eventually found this addressed explicitly:

... SUMO does include a number of predicates that take formulae as arguments, such as holdsDuring, believes, and KappaFn. In Sigma, we also perform a number of "tricks" which allow the user to state things which appears to be higher order, but which are in fact first order and have a simple syntactic transformation to standard first order form.

How does SUMO employ higher order logic? in the Ontology Portal FAQ, March 2005

I'm curious about how variables are handled in these tricks. The code is all there, and I downloaded it, but I didn't get very far.

I actually don't think first-order semantics are the best fit for the Semantic Web, where formulas refer to documents and the formulas they contain; reflection is a key mechanism, and prohibiting loops is unscalable. I think constructive proof is a much better way to think about proofs in N3.

I discovered Artemov's explicit provability stuff a while ago; my notes date from September 2003:

... explicit provability provides a tool of increasing the efficiency of a formal system by verified rules without explosion of the metamathematical level of the system.

So I dug out my LogicOfProofs larch transcription notes and the Explicit provability and constructive semantics article from the Bulletin of Symbolic Logic, volume 7, No.1, pp. 1-36, 2001 and started working on lp.n3, and an N3 rule for the constructive form of modus ponens:

# application
{ (t s) comp ts.
t proves { F => G }.
s proves F
}
=> { ts proves G }.

that is: if t is an algorithm for proving that F implies G, and s is an algorithm for proving s, then the composition of s and t is an algorithm for proving G. This Curry-Howard correspondence is really nifty.

The proof that "Socrates is mortal" from "if Socrates is a man then he is mortal" and "Socrates is a man" looks like:

2000/10/swap$ python cwm.py util/lp.n3 --think
...
( lpex:a2
lpex:a1 )
a l:_g0;
:comp [
a :Proof;
:proves {lpex:Socrates a lpex:Mortal .
} ] .

lpex:a1 a :Proof;
:proves {lpex:Socrates a lpex:Man .
} .

lpex:a2 a :Proof;
:proves {{
lpex:Socrates a lpex:Man .

} log:implies {lpex:Socrates a lpex:Mortal .
} .
} .

... which is much easier to read than cwm's --why style:

2000/10/swap$ python cwm.py test/reason/socrates.n3 --think --why
@forSome :_g0 .
[ a pr:Conjunction,
pr:Proof;
pr:component [
a pr:Inference;
pr:evidence (
[
a pr:Extraction;
pr:because :_g0;
pr:gives {:socrates a :Man .
} ] );
pr:rule [
a pr:Extraction;
pr:because :_g0;
pr:gives {{
:socrates a :Man .

} log:implies {:socrates a :Mortal .
} .
} ] ],
:_g0;
pr:gives {:socrates a :Man,
:Mortal .
{
:socrates a :Man .

} log:implies {:socrates a :Mortal .
} .
} ].

I didn't actually see formulas occuring as terms in that 2001 paper. So it might be a distraction with respect to the original holdsDuring issue. And that version of the logic of proofs includes all of propositional calculus, including the law of the excluded middle. But among his accomplishments I see

Reflexive lambda-calculus. The Curry-Howard isomorphism converting intuitionistic proofs into typed lambda-terms is a simple instance of an internalization property of a our system lambda-infinity which unifies intuitionistic propositions (types) with lambda-calculus and which is capable of internalizing its own derivations as lambda-terms.

so perhaps I should keep studying his stuff. I wish he'd use s-expressions and QUOTE like Moore's ACL2 paper and Chaitin's work rather than doing reflection with Godel numbering. I wonder what HA is; ah... wikipedia to the rescue:

It is essentially Peano arithmetic, minus the law of the excluded middle...

Toward integration of cwm's proof structures with InferenceWeb's PML

Submitted by connolly on Wed, 2006-02-15 23:11. :: |

The proof generation feature of cwm has been in development for a long time. The idea goes back at least as far as the section on proofs and validation in the original 1998 Semantic Web Roadmap, where we find sees of the proof-based access control idea that is now funded under the policy aware web project:

The HTTP "GET" will contain a proof that the client has a right to the response.

And in the TAMI project, we're looking at proofs as part of an audit mechanism for accountable datamining.

The cwm development process incorprates aspects of extrememe programming: test-driven development and a variation on pair programming; when somebody has a new feature working, somebody else in the group tries to (a) reproduce the test results and (b) review the tests, if not the code. When the pair agree that the tests are OK, we claim victory in a small group setting, and if that goes well, we make a release or at least send mail about the new feature. This typically takes a week or two or three.

In the case of cwm --why, I have been evaluating the tests since at least as far back as this December 2002 check-in comment on swap/test/reason/Makefile, and I still haven't made up my mind:

date: 2002/12/30 15:00:35; author: timbl; state: Exp; lines: +9 -6
--why works up to reason/t5. GK and SBP's list bugs fixed.

I have tried and failed to understand many times when Tim explained the simplicity of the reason proof ontology. I think I'm finally starting to get it. I'm nowhere near being certain it's free of use/mention problems, but I'm starting to see how it works.

The InferenceWeb folks have all sorts of nifty proof browsing stuff, and they're working with us in the TAMI project. In our meeting last August, they explained PML well enough that TimBL started on to-pml.n3, some rules to convert cwm's proof structures to PML. The rest of the integration task has been starved by work on SPARQL and Paulo moving to UTEP and all sorts of other stuff, but we seem to be moving again.

I tried running one of my versioning proofs through to-pml.n3 and then looking at it with the InferenceWeb proof browser, but I haven't got the PML structure right and it doesn't give very detailed diagnostics.

I made some progress by loading one of the PML examples into tabulator (alpha) and loading my swap/reason style proof in there and using the outline view to browse the structure. (It turns out that TimBL started coding the tabulator one day when he was having trouble reading a proof. GMTA ;)I discovered that PML is striped:

  • NodeSet
    • isConsequentOf
      • InferenceStep
        • hasAntecedent
          • NodeSet
            • ...

... where the swap/reason ontology just has Step objects and hangs the conclusions off them.

That was the key to some big changes to to-pml.n3. I don't have the output working in the PML browser yet, but Paolo sent me a pointer to a PML primer, which seems to have the remaining clues that I need.

See also: help checking/reading some proof related to versioning? to cwm-talk.

Using RDF and OWL to model language evolution

Submitted by connolly on Wed, 2006-02-15 18:37. :: | |

diagram on whiteboard Back in September, the TAG had this great whiteboard discussion of versioning.

versioning diagram TimBL captured an omnigraffle version, which is nice because it's XML and I can convert it to RDF with XSLT, but it's a mac thing and Dave Orchard, who's doing most of the relevant writing, was more inclined to building UML diagrams with Visio. Then Henry Thompson found this violet tool that's just what we need: a light-weight, cross-platform (Java, in this case) editor that produces pretty clean XML representations of UML class diagrams.

This week, Dave sent an update on the terminology section.

That was just the prompt I was looking for to swap in my work on formally defining W3C's namespace change policy options from last September. I thought for a while about how to test that the RDF version of the UML diagram was right, and I realized I could get cwm to smush producers and consumers based on cardinality constraints.

First we extract RDF from the UML diagram...

ext-vers$ make test-agent-pf.n3
xsltproc ... --output ext-vers-uml.rdf grokVioletUML.xsl ext-vers-uml.violet

... and then we mix with some rules to implement OWL cardinalities...

python $swap/cwm.py ext-vers-uml.rdf test-agent.n3 \
owl-excerpt.n3 rdfs-excerpt.n3 --think \
--filter=test-agent-goal.n3 --why >test-agent-pf.n3

And indeed, it concludes a1 = a2.

I'm working on getting a proof/justification using --why and some proof browsing tools, but that's another story.

One of the reasons I'm pleased with this ontology that the TAG came up with in Edinburgh is that it allows me to formalize the concept of sublanguage that I have heard TimBL talk about now and again. For example:

  1. HTML2 is a sublanguage of HTML4.
  2. a production of chap2 is in HTML2; the intent is chap1in
  3. a consumption of chap2, ?C is in HTML4
  4. Show: ?C intent chap1in
  5. ?C intent chap1in by 1, 2, 3 and defn sublanguage
where the definition of sublanguage is:
{
[] ev:text ?TXT; :language [ is :sublanguage of ?BIG ]; :intent ?I.
?COMM ev:text ?TXT; :language ?BIG.
} => { ?COMM :intent ?I }.
{
[] ev:text ?TXT; :language ?BIG; :impact ?I.
?COMM ev:text ?TXT; :language [ is :sublanguage of ?BIG ].
} => { ?COMM :impact ?I }.

formally closing the feedback loop

Submitted by connolly on Fri, 2006-02-10 01:39. :: |

Speaking of checking specs against tests, I suppose the first time I did that was when I was editing HTML 2. I maintained a collection of tests that I checked against the DTD whenever I changed it.

When we couldn't get other collaborators to install an SGML parser and do likewise, Mark Gaither and announced the zero-install validation service, a precursor to the W3C markup validation service.

I'm on an Adventures in Formal Methods panel at the W3C tech plenary day. I expect it will be useful to tell that story to make the point that formalizing knowledge allows us to delegate tasks to the machine.

tags pending: web history? quality?

bnf2turtle -- write a turtle version of an EBNF grammar

Submitted by connolly on Fri, 2006-02-10 01:11. :: |

In order to cross one of the few remaining t's on the SPARQL spec, I wrote bnf2turtle.py today. It turned out to be such a nice piece of code that I elaborated the module documentation using ReStructuredText. It's checked into the SPARQL spec editor's draft materials, but I'll probably move it to the swap codebase presently. Meanwhile, here's the formatted version of the documentation:

Author: Dan Connolly
Version: $Revision: 1.13 $ of 2006-02-10
Copyright: W3C Open Source License Share and enjoy.

Usage

Invoke a la:

python bnf2turtle.py foo.bnf >foo.ttl

where foo.bnf is full of lines like:

[1] document ::= prolog element Misc*

as per the XML formal grammar notation. The output is Turtle - Terse RDF Triple Language:

:document rdfs:label "document"; rdf:value "1";
 rdfs:comment "[1] document ::= prolog element Misc*";
 a g:NonTerminal;
  g:seq (
    :prolog
    :element
    [ g:star
      :Misc
     ]
   )
.

Motivation

Many specifications include grammars that look formal but are not actually checked, by machine, against test data sets. Debugging the grammar in the XML specification has been a long, tedious manual process. Only when the loop is closed between a fully formal grammar and a large test data set can we be confident that we have an accurate specification of a language [1].

The grammar in the N3 design note has evolved based on the original manual transcription into a python recursive-descent parser and subsequent development of test cases. Rather than maintain the grammar and the parser independently, our goal is to formalize the language syntax sufficiently to replace the manual implementation with one derived mechanically from the specification.

[1]and even then, only the syntax of the language.

Related Work

Sean Palmer's n3p announcement demonstrated the feasibility of the approach, though that work did not cover some aspects of N3.

In development of the SPARQL specification, Eric Prud'hommeaux developed Yacker, which converts EBNF syntax to perl and C and C++ yacc grammars. It includes an interactive facility for checking strings against the resulting grammars. Yosi Scharf used it in cwm Release 1.1.0rc1, which includes a SPAQRL parser that is almost completely mechanically generated.

The N3/turtle output from yacker is lower level than the EBNF notation from the XML specification; it has the ?, +, and * operators compiled down to pure context-free rules, obscuring the grammar structure. Since that transformation is straightforwardly expressed in semantic web rules (see bnf-rules.n3), it seems best to keep the RDF expression of the grammar in terms of the higher level EBNF constructs.

Open Issues and Future Work

The yacker output also has the terminals compiled to elaborate regular expressions. The best strategy for dealing with lexical tokens is not yet clear. Many tokens in SPARQL are case insensitive; this is not yet captured formally.

The schema for the EBNF vocabulary used here (g:seq, g:alt, ...) is not yet published; it should be aligned with swap/grammar/bnf and the bnf2html.n3 rules (and/or the style of linked XHTML grammar in the SPARQL and XML specificiations).

It would be interesting to corroborate the claim in the SPARQL spec that the grammar is LL(1) with a mechanical proof based on N3 rules.

Background

The N3 Primer by Tim Berners-Lee introduces RDF and the Semantic web using N3, a teaching and scribbling language. Turtle is a subset of N3 that maps directly to (and from) the standard XML syntax for RDF.

I started with a kludged and broken algorithm for handling the precedence of | vs concatenation in EBNF rules; for a moment I thought the task required a yacc-like LR parser, but then I realized recursive descent would work well enough. A dozen or so doctests later, it did indeed work. I haven't checked the resulting grammar against the SPARQL tests yet, but it sure looks right.

Then I wondered how much of the formal grammar notation from the XML spec I hadn't covered, so I tried it out on the XML grammar (after writing a 20 line XSLT transformation to extract the grammar from the XML REC) and it worked the first time! So I think it's reasonably complete, though it has a few details that are hard-coded to SPARQL.

See also: cwm-talk discussion, swig chump entry.

tabulator use cases: when can we meet? and PathCross

Submitted by connolly on Wed, 2006-02-08 13:48. :: | | | | |

I keep all sorts of calendar info in the web, as do my colleagues and the groups and organizations we participate in.

Suppose it was all in RDF, either directly as RDF/XML or indirectly via GRDDL as hCalendar or the like.

Wouldn't it be cool to grab a bunch of sources, and then tabulate names vs. availability on various dates?

I would probably need rules in the tabulator; Jos's work sure seems promising.

Closely related is the PathCross use case...

Suppose I'm travelling to Boston and San Francisco in the next couple months. I'd like my machine to let me know I have a FriendOfaFriend who also lives there or plans to be there.

See also the Open Group's Federated Free/Busy Challenge.

python, javascript, and PHP, oh my!

Submitted by connolly on Tue, 2006-02-07 11:28. ::

My habits for developing quality code in python are bumping up against the fact that the deployment platforms for the web client and server are javascript and PHP, respectively.

I love the python doctest module. uripath.py is a pretty good example of how I like to use it to simultaneously document and test code:

def splitFrag(uriref):
    """split a URI reference between the fragment and the rest.

    Punctuation is thrown away.

    e.g.
    
    >>> splitFrag("abc#def")
    ('abc', 'def')
[...]

def splitFragP(uriref, punct=0):
    """split a URI reference before the fragment

    Punctuation is kept.
    
    e.g.

    >>> splitFragP("abc#def")
    ('abc', '#def')
[...]

Another important way that python is self-documenting is that it meets the unambiguity requirement: you can pick up any .py file and trace every identifier back to what it refers to by following your nose:

  • for local variables, normal static scoping rules work; just scan up and look for an assignment or a function parameter
  • for imported names, find the relavent import statement. from foo import * is evil, of course.
  • global variables are explicitly declared as such

OK, full disclosure: you need to know the python built-ins, and when you see paramx.methody(z), you have an unbounded search for methody, which makes doctests that show what class paramx comes from pretty important. Mapping from the relevant import statement to the corresponding .py file may involve the usual search path nightmares; python doesn't solve that. redfoot's red_import is interesting. And I'm not sure if eggs are a step in the right direction or the wrong direction; gotta study them more. I try to ground import statements in the web a la:

import MySQLdb # MySQL for Python
               # http://sourceforge.net/projects/mysql-python
               # any Python Database API-happy implementation will do.

... so that you can follow your nose from the ImportError traceback to resolve the dependency.

Now timbl has started migrating the swap/cwm stuff to javascript. Let's look at uri.js:

/** return the protocol of a uri **/
function uri_protocol(uri) {
...

Thanks for trying to document each function, but that sort of comment isn't worthwhile; the risk that it'll get out of sync with the code is greater than the information it provides. Back to naming...

function URIjoin(given, base) {
...
	if (base<0) {alert("Invalid base URL "+ base); return given}

Where does alert come from? Is that in the ECMAScript standard? Or in some Netscape devedge stuff?

But more importantly: why not raise an exception? Javascript does have throw/catch, no? Is it not the norm to use them? As I argued in my contribution to the Python, Tcl and Perl, oh my! thread in 1996, the community norms are at least as important, if not more important, than the language features, when it comes to developing quality code.

I keep running into javascript and PHP code that I want to read and wishing for doctests because I can't figure out which end is up.

Whence comes kb in this bit from term.js? Do I face an unbounded search?

RDFFormula.prototype.fromNT = function(str) {
	var len = str.length
	var x
	if (str[0] == '<') return kb.sym(str.slice(1,len-1))

Maybe I just need to study the standard libraries a bit more, but I hear that the drupal project has coding conventions lots of people like for developing quality PHP code; I hope to study those. And the PEAR community must have some norms and best practices. I went looking for javascript testing stuff and ran into JSAN, a CPAN work-alike. That sort of infrastructure naturally reinforces quality norms.

See also: delicious tags , , , , .

RSS is dead; long live RSS

Submitted by connolly on Thu, 2006-02-02 23:36. ::

Just after I saw the news that Norm is killing his RSS feed in favor of Atom, in my city's newsletter, they tout that they're starting to use RSS.

They don't seem to keep back issues of the newsletter online, and their markup is invalid. Maybe they'd better study some of the older technologies like URIs and HTML before moving up to RSS.

Escaped markup really is nasty. Is there a way to do it with rdf:parseType="Literal" that works? Why isn't this answer readily available from a test suite? There are validation services for RSS and Atom, and I'm sure various tools have various regression suites, but I don't see much in the way of collaborative development of test suites for feeds.

I've done a little bit of work on hCard testing; I try to stay test-driven as I write GRDDL transformations for microformats. I'd like to find more time for that sort of thing.

Using truth maintenance techniques in RDF stores?

Submitted by connolly on Mon, 2006-01-23 11:23. :: | |

Almost every RDF store actually stores more than triples; they store "quads" or "contexts" or provenance or something...

The cwm formula interface used to use (f, p, s, o), where f was the formula in which the triple occured; then it grew to (f, p, s, o, why) where why was a proof structure. Another common pattern is (s, p, o, source), where source points to a file/resource that the triple was loaded from. In re-writing an RDF store for tabulator, TimBL seems to be settling calling that 4th column why, which works like source in common cases. This is inspired at least in part by some truth maintenance stuff that Chris presented Oct 6 2005 in the Notions and Notations class.

Meanwhile, Edd observes lots of submissions about RDF Stores for XTech 2006. I hope that folks building stores look at TMS architectures. At least one group is:

Word has it the classic papers are:

See also: WebDataInterfaceDesign.

Syndicate content