SPARQL
Linked Data at WWW2007: GRDDL, SPARQL, and Wikipedia, oh my!
Last Tuesday, TimBL started to gripe that the WWW2007 program had lots of stuff that he wanted to see all at the same time; we both realized pretty soon: that's a sign of a great conference.
That afternoon, Harry Halpin and I gave a GRDDL tutorial. Deploying Web-scale Mash-ups by Linking Microformats and the Semantic Web is the title Harry came up with... I was hesitant to be that sensationalist when we first started putting it together, but I think it actually lived up to the billing. It's too bad last-minute complications prevented Murray Maloney from being there to enjoy it with us.
For one thing, GRDDL implementations are springing up all over. I donated my list to the community as the GrddlImplementations wiki topic, and when I came back after the GRDDL spec went to Candidate Recommendation on May 2, several more had sprung up.
What's exciting about these new implementations is that they go beyond the basic "here's some RDF data from one web page" mechanism. They're integrated with RDF map/timeline browsers, and SPARQL engines, and so on.
The example from the GRDDL section of the semantic web client library docs (by Chris Bizer, Tobias Gauß, and Richard Cyganiak) is just "tell me about events on Dan's travel schedule" but that's just the tip of the iceberg: they have implemented the whole LinkedData algorithm (see the SWUI06 paper for details).
With all this great new stuff popping up all over, I felt I should include it in our tutorial materials. I'm not sure how long OpenLink Virtuoso has had GRDDL support (along with database integration, WEBDAV, RSS, Bugzilla support, and on and on), but it was news to me. But I also had to work through some bugs in the details of the GRDDL primer examples with Harry (not to mention dealing with some unexpected input on the HTML 5 decision). So the preparation involved some late nights...
I totally forgot to include the fact that Chime got the Semantic Technologies conference web site using microformats+GRDDL, and Edd did likewise with XTech.
But the questions from the audience showed they were really following along. I was a little worried when they didn't ask any questions about the recursive part of GRDDL; when I prompted them, they said they got it. I guess verbal explanations work; I'm still struggling to find an effective way to explain it in the spec. Harry followed up with some people in the halls about the spreadsheet example; as mnot said, Excel spreadsheets contain the bulk of the data in the enterprise.
One person was even followingn along closely enough to help me realize that the slide on monotonicity/partial understanding uses a really bad example.
The official LinkedData session was on Friday, but it spilled over to a few impromptu gatherings; on Wednesday evening, TimBL was browsing around with the tabulator, and he asked for some URIs from the audience, and in no time, we were browsing protiens and diseases, thanks to somebody who had re-packaged some LSID-based stuff as HTTP+RDF linked data.
Giovanni Tummarello showed a pretty cool back-link service for the Semantic Web. It included support for finding SPARQL endpoints relevant to various properties and classes, a contribution to the serviceDescription issue that the RDF Data Access Working Group postponed. I think I've seen a few other related ideas here and there; I'll try to put them in the ServiceDescription wiki topic when I remember the details...
Chris Bizer showed that dbpedia is the catalyst for an impressive federation of linked data. Back in March 2006, Toward Semantic Web data from Wikipedia was my wish into the web, and it's now granted. All those wikipedia infoboxes are now out there for SPARQLing. And other groups are hooking up musicbrainz and wordnet and so on. After such a long wait, it seems to be happening so fast!
Speaking of fast, the Semantic MediaWiki project itself is starting to do performance testing with a full copy of wikipedia, Denny told us on Friday afternoon in the DevTrack.
Also speaking of fast, how did OpenLink go from not-on-my-radar to supporting every Semantic Web Technology I have ever heard of in about a year? I got part of the story in the halls... it started with ODBC drivers about a decade ago, which explains why their database integration is so good. Kingsley, here's hoping we get to play volleyball sometime. It's a shame we had just a few short moments together in the halls...
Stitching the Semantic Web together with OWL at AAAI-06
I was pleased to find that AAAI '06 in Boston a couple weeks ago had a spectrum of people I know and don't know and work that's near and far from my own. The talk about the DARPA grand challenge was inspiring.
But closer to my work, I ran into Jeff Heflin, who I worked with on DAML and especially the OWL requirements document. Amid too many papers about ontologies for the sake of ontologies and threads like Is there real world RDF-S/OWL instance data?, his Investigation into the Feasibility of the Semantic Web is a breath of fresh air. The introduction sets out their approach this way:
Our approach is to use axioms of OWL, the de facto Semantic Web language, to describe a map for a set of ontologies. The axioms will relate concepts from one ontology to the other. ... There is a well-established body of research in the area of automated ontology alignment. This is not our focus. Instead we investigate the application of these alignments to provide an integrated view of the Semantic Web data.
(emphasis mine). The rest of the paper justifies this approach, leading up to:
We first query the knowledge base from the perspective of each of the 10 ontologies that define the concept Person. We now ask for all the instances of the concept Person. The results vary from 4 to 1,163,628. We then map the Person concept from all the ontologies to the Person concept defined in the FOAF ontology. We now issue the same query from the perspective of this map and we get 1,213,246 results. The results now encompass all the data sources that commit to these 10 ontologies. Note: a pair wise mapping would have taken 45 mapping axioms to establish this alignment instead of the 9 mapping axioms that we used. More importantly due to this network effect of the maps, by contributing just a single map, one will
automaticallyget the benefit of all the data that is available in the network.
That's fantastic stuff.
We now pause for a word from Steve
Lawrence; NEC Research Institute, to lament the lack of free
online proceedings for AAAI: Articles freely available online are
more highly cited. For greater impact and faster scientific progress,
authors and publishers should aim to make research easy to access.
OK, now back to the great paper...
Along the way, they give a definition of a knowledge function, K, that is remarkably similar to log:semantics from N3. They also define a commitment function that is basically the ontological closure pattern.
The approach to querying all this data is something they call DLDB, which comes from a paper they submitted to the ISWC Practical and Scalable Semantic Systems workshop. Darn! no full text proceedings online again. Ah... Jeff's pubs include a tech report version. To paraphrase: there's a table for each class and a table for each property that relates rows from the class tables. They use a DL reasoner to find subclass relationships, and they make views out of them. I have never seen this approach to RdfAndSql before; it sure looks promising. I wonder if we can integrate it into our dbview work somehow and perhaps into our truth-maintenance system in the TAMI project.
This wasn't the only work at AAAI on scalable, practical knowledge representation. I caught just a glance at some other papers at the conference that exploit wikipedia as a dataset in various algorithms. I hope to study those more.
I also ran into Ben Kuipers, whose Algernon and Access-Limited Logic has long appealed to me as an approach to reasoning that might work well when scaled up to Semantic Web data sets. That work is mostly on hold; we started talking about getting it going again, but didn't get very far into the conversation. I hope to pick that up again soon.
I gather the 1.0 release of OpenCyc happened at the conference; there's a lot of great stuff in cyc, but only time will tell how well it will integrate with other Semantic Web stuff.
Meanwhile, a handy citation for Heflin's paper...
- An Investigation into the Feasibility of the Semantic Web. In Proc. of the Twenty First National Conference on Artificial Intelligence (AAAI 2006), Boston, USA, 2006 (abstract)
That's marked up using an XHTML/LaText/BibTex idiom that I'm working on so that we get BibTex for free:
@inproceedings{pan06a,
title = "{An Investigation into the Feasibility of the Semantic Web}",
author = {Z. Pan and A. Qasem and J. Heflin},
booktitle = {Proc. of the Twenty First National Conference on Artificial Intelligence (AAAI 2006)},
year = {2006},
address = {Boston, USA},
}
Exporting databases in the Semantic Web with SPARQL, D2R, dbview, ARC, and such
The developer track at WWW2006 last week in Edinburgh was really cool; you had to show up on time or you couldn't fit in the room! One of the coolest talks was D2R-Server - Publishing Relational Databases on the Web as SPARQL-Endpoints.. I see D2R Server is released now. Cool.
Yes, storing RDF in a SQL database using 3-column tables (or 4 or 5 or 6...) is cool as far as it goes, but I'm gland we're finally seeing more work on taking existing SQL databases (whose schemas are not designed with RDF in mind) and exporting them as RDF.
TimBL wrote a design note on Relational Databases on the Semantic Web in 1998. In 2002, I wrote dbview.py, a couple hundred lines of python that implements parts of it. Rob Crowell picked it up and the 2005/2006 version of dbview.py now does foreign keys and backlinks.
D2R gets points for using RDF for their configuration/mapping info. The slides showed turtle/n3. Why are the dbin brainlets in XML but not RDF? I wonder.
D2R Server has a mapping layer; dbview assumes that will be handled with rules. The choice of URIs for column names is interesting. D2R uses jdbc:mysql://127.0.0.1/wordpress#users1, but dbview is all about embedding a SQL database in HTTP space, so we use URIs like http://db.example/orders/customers/custno/1#item. In dbview, the decisions about when to use / and when to use # are made so that the result is browseable. In D2R, the default URIs don't matter as much because it's expected that they'll be mapped to a more well-known ontology/schema like foaf.
dbview is still just a few hundred lines of python; we haven't integrated the SPARQL parser that Yosi developed for cwm, nor integrated EricP's work on federated query.
Speaking of federated query... on Wednesday at the conference, I saw Tim Finin in the poster session. He showed me something the swoogle folks are cooking up: you give it a SPARQL query, and it looks at the terms used in your query and suggests documents you should put in your SPARQL dataset to run your query against. I hope to hear more about that.
Somewhere in EricP's work is one of the several SPARQL-to-SQL rewriters out there... oh... I thought the HP tech report, A relational algebra for SPARQL was another one, but it seems to be by Richard Cyganiak, one of the D2R guys.
Benjamin Nowack's Feb 2006 item announced a SPARQL-to-SQL rewriter for his ARC RDF store for PHP.
Hmm... maybe it's time for a ScheduledTopicChat on SPARQL, SQL, and RDF? If you're interested, suggest a couple times that would be good for you in a comment or in mail to me and a public archive.
On GData, SPARQL update, and RDF Diff/Sync
The Google Data APIs Protocol is pretty interesting. It seems to be based on the Atom publishing protocol, which is a pretty straightforward application of HTTP and XML, so that's a good thing.
The query features seem to be less expressive than the SPARQL protocol, but GData has an update feature, while the SPARQL update issue is postponed. Updating at the triple level is tricky. I helped TimBL refine Delta: an ontology for the distribution of differences between RDF graphs a bit, and there's working code in cwm. But I haven't really managed to use it in practical settings. My PDA's calendar has an XMLRPC service where I can update a whole record at a time, just like GData. I assume caldav does likewise.
The GData approach to concurrency looks quite reasonable. I haven't studied the authentication mechanism. I hope to get to that presently.
Consensus and community review in open source and open standards
Consensus is a core value of W3C and lots of other open standards and open source communities. I used to think that a decision where almost everybody agreed except a few objectors was an example of consensus. That was based on my experience in the IETF, with its "rough consensus and running code" mantra. Then I learned that this is quite a stretch with respect to the normal dictionary meaning of "consensus".
The debian community seems to be examining the meaning of "consensus":
Many things are done on behalf of the project without every individual member supporting them - for instance, Mark is vigorously opposed to Debian UK being granted a trademark license, even though Branden (and therefore the project) granted one. The key difference here is the difference between consensus and unanimity.
Matthew Garrett 2006-04-04
Definitions of "consensus" vary. The wikipedia article on
consensus has a good synthesis: Achieving consensus requires
serious treatment of every group member's considered opinion.
W3C's consensus policy formally distinguishes the case of even one objection from consensus:
The following terms are used in this document to describe the level of support for a decision among a set of eligible individuals:
- Consensus: A substantial number of individuals in the set support the decision and nobody in the set registers a Formal Objection. Individuals in the set may abstain. Abstention is either an explicit expression of no opinion or silence by an individual in the set. Unanimity is the particular case of consensus where all individuals in the set support the decision (i.e., no individual in the set abstains).
- Dissent: At least one individual in the set registers a Formal Objection.
...
In some cases, even after careful consideration of all points of view, a group might find itself unable to reach consensus. The Chair may record a decision where there is dissent (i.e., there is at least one Formal Objection) so that the group may make progress (for example, to produce a deliverable in a timely manner). Dissenters cannot stop a group's work simply by saying that they cannot live with a decision. When the Chair believes that the Group has duly considered the legitimate concerns of dissenters as far as is possible and reasonable, the group should move on.
That last bit is important, since "you can't schedule consensus," another lesson I learned from Michael Sperberg-McQueen. And we do try to schedule our deliverables.
The RDF Data Access Working Group (DAWG) has been working on SPARQL for quite a while now. Our first public release was October 2004. Since then, we have handled comments from a few dozen people and tried to reach consensus with them. We weren't always successful. Our request for Candidate Recommendation shows the outstanding formal objections, each one of which got reviewed by The Director. Though W3C did grant that request for Candidate Recommendation status for SPARQL today (yay!), we need to go back over some of the comments and make test cases and maybe some clarifications. I hope that, in the process, we can address some of the concerns of those with formal objections and achieve consensus with them.
Also, I remember a time though I can't confirm from The Tao of IETF or any of the other records that I searched when people and companies who wanted to deploy new technology on the Internet were expected to submit their proposal for community review before deploying widely. I wrote a message on squatting on link relationship names, x-tokens, registries, and URI-based extensibility to www-tag in April 2005, with concerns about several mechanisms which were deployed, some at giga-scale, as far I can tell, without any community review. I think I'll repeat just about the whole thing:
When somebody wants to deploy a new idiom or a new term in the Web, they're more than welcome to make up a URI for it...
"[URI] is an agreement about how the Internet community allocates names and associates them with the resources they identify."
webarch
We particularly encourage this for XML vocabularies...
The purpose of an XML namespace (defined in [XMLNS]) is to allow the deployment of XML vocabularies (in which element and attribute names are defined) in a global environment and to reduce the risk of name collisions in a given document when vocabularies are combined."
webarch
But while making up a URI is pretty straightforward, it's more trouble than not bothering at all. And people usually don't do any more work than they have to.
There is a time and a place for just using short strings, but since short strings are scarce resources shared by the global community, fair and open processes should be used to manage them. Witness TCP/IP ports, HTML element names, Unicode characters, and domain names and trademarks -- different processes, with different escalation and enforcement mechanisms, but all accepted as fair by the global community, more or less, I think.
The IETF has a tradition of reserving tokens starting with "x-" for experimental use, with the expectation that they'll shed the x- prefix as they're registered by IANA. But it's not really clear how that transition happens.
Witness application/x-www-form-urlencoded. A horrible name, perhaps, but nobody has enough motivation to change it. It's been all the way thru the W3C process... twice now: once for HTML 4 and again in XForms. Hmm... I wonder if it's registered... nope.
A pattern that I'd like to see more of is
- start with a URI for a new term
- if it picks up steam, introduce a synonym that is a short string thru a fair/open process
I'm not sure where the motivation to complete step 2 will come from, but if it doesn't come at all, that's OK. Stopping with a URI term is a lot better than getting stuck with something like x-www-form-urlencoded.
Lately I'm seeing quite the opposite. The HTML specification includes a hook for grounding link relationships in URI space, but people aren't using it:
when Google sees the attribute (rel="nofollow") on hyperlinks, those links won't get any credit when we rank websites in our search results.
google Jan 2005 announcement
By adding rel="tag" to a hyperlink, a page indicates that the destination of that hyperlink is an author-designated "tag" (or keyword/subject) of the current page."
technorati RelTag
What are the prefetching hints?
The browser looks for either an HTML <link> tag or an HTTP Link: header with a relation type of either next or prefetch.
mozilla prefetching FAQ
Google is sufficiently influential that they form a critical mass for deploying these things all by themselves. While Google enjoys a good reputation these days, and the community isn't complaining much, I don't think what they're doing is fair. Other companies with similarly influential positions used to play this game with HTML element names, and I think the community is decided that it's not fair or even much fun.
Deployment of the technorati RelTag thingy seems much more grass-roots, peer-to-peer. But even so, it's only a matter of time before we see a name clash. So perhaps it's fair, but it doesn't seem wise.
I think all three of these are cases of squatting on the community resource of link relationship names.
Should all new link relationships go thru the W3C HTML Working Group? No, of course not. The profile mechanism is there to decentralize the process.
Should W3C run a registry of link relationship names? That seems boring and inefficient, to me. It can't possibly cost less time and effort to apply for a W3C-registered link relationship name than it can to reserve a domain name and run a web server, can it?
If Google and Mozilla really want the community agree to these short names, I'd be happy to see them use the W3C member submissions process.
using JSON and templates to produce microformat data
In Getting my Personal Finance data back with hCalendar and hCard, I discussed using JSON-style records as an intermediate structure between tab-separated transaction report data and hCalendar. I just took it a step further in palmagent; hipsrv.py uses kid templates, so the markup can be tweaked independently of the normalization and SPARQL-like filtering logic. I expect to be able to do RDF/XML output thru templates too.
Working at the JSON level is nice; when I want to make a list of 3 numbers, I can just do that, unlike in XML where I have to make up names and think about whether to use a space-separated microparsed attribute value or a massively redundant element structure.
It brings me back to my March 1997 essay for the Web Apps issue on Distributed Objects and noodling on VHLL types in ILU.
Getting my Personal Finance data back with hCalendar and hCard
The Quicken Interchange Format (QIF) is notoriously inadequate for clean import/export. The instructions for migrating Quicken data across platforms say:
- From the old platform, dump it out as QIF
- On the new platform, read in the QIF data
- After importing the file, verify that account balances in your new Quicken for Mac 2004 data file are the same as those in Quicken for Windows. If they don't match, look for duplicate or missing transactions.
I have not migrated my data from Windows98 to OS X because of this mess. I use win4lin on my debian linux box as life-support for Quicken 2001.
Meanwhile, Quicken supports printing any report to a tab-separated file, and I found that an exhaustive transaction report represents transfers unambiguously. Since October 2000, when my testing showed that I could re-create various balances and reports from these tab-separated reports, I have been maintaining a CVS history of my exported Quicken data, splitting it every few years:
$ wc *qtrx.txt
4785 38141 276520 1990-1996qtrx.txt
6193 61973 432107 1997-1999qtrx.txt
4307 46419 335592 2000qtrx.txt
5063 54562 396610 2002qtrx.txt
5748 59941 437710 2004qtrx.txt
26096 261036 1878539 total
I started a little module on dev.w3.org... I call it Quacken currently, but I think I'm going to have to rename it for trademark reasons. I started with normalizeQData.py to load the data into postgress for use with saCASH, but then saCASH went Java/Struts and all way before debian supported Java well enough for me to follow along. Without a way to run them in parallel and sync back and forth, it was a losing proposition anyway.
Then I managed to export the data to the web by first converting it to RDF/XML:
qtrx93.rdf: $(TXTFILES)
$(PYTHON) $(QUACKEN)/grokTrx.py $(TXTFILES) >$@
... and then using searchTrx.xsl (inside a trivial CGI script) that puts up a search form, looks for the relevant transactions, and returns them as XHTML. I have done a few other reports with XSLT; nothing remarkable, but enough that I'm pretty confident I could reproduce all the reports I use from Quicken. But the auto-fill feature is critical, and I didn't see a way to do that.
Then came google suggest and ajax. I'd really like to do an ajax version of Quicken.
I switched the data from CVS to mercurial a few months ago, carrying the history over. I seem to have 189 commits/changesets, of which 154 are on the qtrx files (others are on the makefile and related scripts). So that's about one commit every two weeks.
Mercurial makes it easy to keep the whole 10 year data set, with all the history, in sync on several different computers. So I had it all with me on the flight home from the W3C Tech Plenary in France, where we did a microformats panel. Say... transactions are events, right? And payee info is kinda like hCard...
So factored out the parts of grokTrx.py that do the TSV file handling (trxtsv.py) and wrote an hCalendar output module (trxht.py).
I also added some SPARQL-ish filtering, so you can do:
python trxht.py --account 'MIT 2000' --class 200009xml-ny 2000qtrx.txt
And get a little microformat expense report:
9/20/00 SEPTEMBERS STEAKHOUSE ELMSFORD NY MIT 2000 19:19 c [Citi Visa HI]/200009xml-ny 29.33 9/22/00 RAMADA INNS ELMSFORD GR ELMSFORD NY MIT 2000 3 nights c [Citi Visa HI]/200009xml-ny 603.96 9/24/00 AVIS RENT-A-CAR 1 WHITE PLAINS NY MIT 2000 c [Citi Visa HI]/200009xml-ny 334.45 1/16/01 MIT MIT 2000 MIT check # 20157686 dated 12/28/00 c [Intrust Checking]/200009xml-ny -967.74
Mercurial totally revolutionizes coding on a plane. There's no way I would have been as productive if I couldn't commit and diff and such right there on the plane. I'm back to using CVS for the project now, in order to share it over the net, since I don't have mercurial hosting figured out just yet. But here's the log of what I did on the plane:
changeset: 19:d1981dd8e140
user: Dan Connolly <connolly@w3.org>
date: Sat Mar 4 20:48:44 2006 -0600
summary: playing around with places
changeset: 18:9d2f0073853b
user: Dan Connolly <connolly@w3.org>
date: Sat Mar 4 18:21:35 2006 -0600
summary: fixed filter arg reporting
changeset: 17:3993a333747b
user: Dan Connolly <connolly@w3.org>
date: Sat Mar 4 18:10:10 2006 -0600
summary: more dict work; filters working
changeset: 16:59234a4caeae
user: Dan Connolly <connolly@w3.org>
date: Sat Mar 4 17:30:28 2006 -0600
summary: moved trx structure to dict
changeset: 15:425aab9bcc52
user: Dan Connolly <connolly@w3.org>
date: Sat Mar 4 20:57:17 2006 +0100
summary: vcards for payess with phone numbers, states
changeset: 14:cbd30e67647a
user: Dan Connolly <connolly@w3.org>
date: Sat Mar 4 19:12:38 2006 +0100
summary: filter by trx acct
changeset: 13:9a2b49bc3303
user: Dan Connolly <connolly@w3.org>
date: Sat Mar 4 18:45:06 2006 +0100
summary: explain the filter in the report
changeset: 12:2ea13bafc379
user: Dan Connolly <connolly@w3.org>
date: Sat Mar 4 18:36:09 2006 +0100
summary: class filtering option
changeset: 11:a8f550c8759b
user: Dan Connolly <connolly@w3.org>
date: Sat Mar 4 18:24:45 2006 +0100
summary: filtering in eachFile; ClassFilter
changeset: 10:acac37293fdd
user: Dan Connolly <connolly@w3.org>
date: Sat Mar 4 17:53:18 2006 +0100
summary: moved trx/splits fixing into eachTrx in the course of documenting trxtsv.py
changeset: 9:5226429e9ef6
user: Dan Connolly <connolly@w3.org>
date: Sat Mar 4 17:28:01 2006 +0100
summary: clarify eachTrx with another test
changeset: 8:afd14f2aa895
user: Dan Connolly <connolly@w3.org>
date: Sat Mar 4 17:19:36 2006 +0100
summary: replaced fp style grokTransactions with iter style eachTrx
changeset: 7:eb020cda1e67
user: Dan Connolly <connolly@w3.org>
date: Sat Mar 4 16:16:43 2006 +0100
summary: move isoDate down with field routines
changeset: 6:123f66ac79ed
user: Dan Connolly <connolly@w3.org>
date: Sat Mar 4 16:14:45 2006 +0100
summary: tweak docs; noodle on CVS/hg scm stuff
changeset: 5:4f7ca3041f9a
user: Dan Connolly <connolly@w3.org>
date: Sat Mar 4 16:04:07 2006 +0100
summary: split trxtsv and trxht out of grokTrx
changeset: 4:95366c104b42
user: Dan Connolly <connolly@w3.org>
date: Sat Mar 4 14:48:04 2006 +0100
summary: idea dump
changeset: 3:62057f582298
user: Dan Connolly <connolly@w3.org>
date: Sat Mar 4 09:55:48 2006 +0100
summary: handle S in num field
changeset: 2:0c23921d0dd3
user: Dan Connolly <connolly@w3.org>
date: Sat Mar 4 09:38:54 2006 +0100
summary: keep tables bounded; even/odd days
changeset: 1:031b9758304c
user: Dan Connolly <connolly@w3.org>
date: Sat Mar 4 09:19:05 2006 +0100
summary: table formatting. time to land
changeset: 0:2d515c48130b
user: Dan Connolly <connolly@w3.org>
date: Sat Mar 4 07:55:58 2006 +0100
summary: working on plane
I used doctest unit testing quite a bit, and rst for documentation:
Usage
Run a transaction report over all of your data in some date range and print it to a tab-separated file, say, 2004qtrx.txt. Then invoke a la:
$ python trxht.py 2004qtrx.txt >,x.html
$ xmlwf ,x.html
$ firefox ,x.htmlYou can give multiple files, as long as the ending balance of one matches the starting balance of the next:
$ python trxht.py 2002qtrx.txt 2004qtrx.txt >,x.htmlSupport for SPARQL-style filtering is in progress. Try:
$ python trxht.py --class myclass myqtrx.txt >myclass-transactions.htmlto simulate:
describe ?TRX where { ?TRX qt:split [ qs:class "9912mit-misc"] }.Future Work
- add hCards for payees (in progress)
- pick out phone numbers, city/state names
- support a form of payee smushing on label
- make URIs for accounts, categories, classses, payees
- support round-trip with QIF; sync back up with RDF export work in grokTrx.py
- move the quacken project to mercurial
- proxy via dig.csail.mit.edu or w3.org? both?
- run hg serve on homer? swada? login.csail?
- publish hg log stuff in a _scm/ subpath; serve the current version at the top
Reflections on the W3C Technical Plenary week
The last item on the agenda of the TAG meeting in France was "Reviewing what we have learned during a full week of meetings". I proposed that we do it on the beach, and it carried.
By then, the network woes of Monday and Tuesday had largely faded from memory.
I was on two of the plenary day panels. Tantek reports on one of them: Microformats voted best session at W3C Technical Plenary Day!. My presentation in that panel was on GRDDL and microformats. Jim Melton followed with his SPARQL/SQL/XQuery talk. Between the two of them, Noah Mendelsohn said he thought the Semantic Web might just be turning a corner.
My other panel presentation was Feedback loops and formal systems where I talked about UML and OWL after touching on contrast between symbolic approaches like the Semantic Web and statistical approaches like pagerank. Folksonomies are an interesting mixture of both, I suppose. Alistair took me to task for being sloppy with the term "chaotic system"; he's quiet right that complex system is the more appropriate description of the Web.
The TAG discussion of that session started with jokes about how formal systems is soporific enough without putting it right after a big French lunch. TimBL mentioned the scheme denotational semantics, and TV said that Jonathan Rees is now at Creative Commons. News to me. I spent many, many hours poring over his scheme48 code a few years back. I don't think I knew where the name came from until today: Within 48 hours we had designed and implemented a working Scheme, including read, write, byte code compiler, byte code interpreter, garbage collector, and initialization logic.
The SemWeb IG meeting on Thursday was full of fun lightning talks and cool hacks. I led a GRDDL discussion that went well, I think. The SPARQL calendar demo rocked. Great last-minute coding, Lee and Elias!
There and back again
On the return leg of my itinerary, the captain announced the cruising altitude, as usual, and then added ... which means you'll spend most of today 6 miles above the earth.
My travel checklist worked pretty well, with a few exceptions. The postcard thing isn't a habit yet. I forgot a paperback book; that was OK since I slept quite a bit on the way over and I got into the coding zone on the way back more about that later, I hope.
Other Reflections
See also reflections by:
... and stay tuned for something from
See also: Flickr photo group, NCE bookmarks
Toward Semantic Web data from Wikipedia
When I heard about Wikimania 2006 in August in Boston, I put it on my travel schedule, at least tentatively.
Then I had an idea...
Wikipedia:Infobox where the data lives in wikipedia. sparql, anyone? or grddl?
my bookmarks, 2006-02-16
Then I put the idea in a wishlist slide in my presentation on microformats and GRDDL at the W3C technical plenary last week.
The next day, in the SemWeb IG meeting, I met Markus Krötzsch and at lunch I learned he's working on Semantic MediaWiki, a project to do just what I'm hoping for. From our discussion, I think this could work out really well.
For reference, he's 3rd from the left in a photo from wikimania 2005.
I use wikipedia quite regularly to look up airport codes, latitutes, longitudes, lists of postal codes, and the like; boy would I love to have it all in RDF... maybe using GRDDL on the individual pages, maybe a SPARQL interface from their DB... maybe both.
Hmm... the RDF export of their San Diego demo page seems to conflate pages with topics of pages. I guess I should file a bug.
bnf2turtle -- write a turtle version of an EBNF grammar
In order to cross one of the few remaining t's on the SPARQL spec, I wrote bnf2turtle.py today. It turned out to be such a nice piece of code that I elaborated the module documentation using ReStructuredText. It's checked into the SPARQL spec editor's draft materials, but I'll probably move it to the swap codebase presently. Meanwhile, here's the formatted version of the documentation:
Author: Dan Connolly Version: $Revision: 1.13 $ of 2006-02-10 Copyright: W3C Open Source License Share and enjoy. Usage
Invoke a la:
python bnf2turtle.py foo.bnf >foo.ttlwhere foo.bnf is full of lines like:
[1] document ::= prolog element Misc*as per the XML formal grammar notation. The output is Turtle - Terse RDF Triple Language:
:document rdfs:label "document"; rdf:value "1"; rdfs:comment "[1] document ::= prolog element Misc*"; a g:NonTerminal; g:seq ( :prolog :element [ g:star :Misc ] ) .Motivation
Many specifications include grammars that look formal but are not actually checked, by machine, against test data sets. Debugging the grammar in the XML specification has been a long, tedious manual process. Only when the loop is closed between a fully formal grammar and a large test data set can we be confident that we have an accurate specification of a language [1].
The grammar in the N3 design note has evolved based on the original manual transcription into a python recursive-descent parser and subsequent development of test cases. Rather than maintain the grammar and the parser independently, our goal is to formalize the language syntax sufficiently to replace the manual implementation with one derived mechanically from the specification.
[1] and even then, only the syntax of the language. Related Work
Sean Palmer's n3p announcement demonstrated the feasibility of the approach, though that work did not cover some aspects of N3.
In development of the SPARQL specification, Eric Prud'hommeaux developed Yacker, which converts EBNF syntax to perl and C and C++ yacc grammars. It includes an interactive facility for checking strings against the resulting grammars. Yosi Scharf used it in cwm Release 1.1.0rc1, which includes a SPAQRL parser that is almost completely mechanically generated.
The N3/turtle output from yacker is lower level than the EBNF notation from the XML specification; it has the ?, +, and * operators compiled down to pure context-free rules, obscuring the grammar structure. Since that transformation is straightforwardly expressed in semantic web rules (see bnf-rules.n3), it seems best to keep the RDF expression of the grammar in terms of the higher level EBNF constructs.
Open Issues and Future Work
The yacker output also has the terminals compiled to elaborate regular expressions. The best strategy for dealing with lexical tokens is not yet clear. Many tokens in SPARQL are case insensitive; this is not yet captured formally.
The schema for the EBNF vocabulary used here (g:seq, g:alt, ...) is not yet published; it should be aligned with swap/grammar/bnf and the bnf2html.n3 rules (and/or the style of linked XHTML grammar in the SPARQL and XML specificiations).
It would be interesting to corroborate the claim in the SPARQL spec that the grammar is LL(1) with a mechanical proof based on N3 rules.
Background
The N3 Primer by Tim Berners-Lee introduces RDF and the Semantic web using N3, a teaching and scribbling language. Turtle is a subset of N3 that maps directly to (and from) the standard XML syntax for RDF.
I started with a kludged and broken algorithm for handling the precedence of | vs concatenation in EBNF rules; for a moment I thought the task required a yacc-like LR parser, but then I realized recursive descent would work well enough. A dozen or so doctests later, it did indeed work. I haven't checked the resulting grammar against the SPARQL tests yet, but it sure looks right.
Then I wondered how much of the formal grammar notation from the XML spec I hadn't covered, so I tried it out on the XML grammar (after writing a 20 line XSLT transformation to extract the grammar from the XML REC) and it worked the first time! So I think it's reasonably complete, though it has a few details that are hard-coded to SPARQL.
See also: cwm-talk discussion, swig chump entry.


