Modelling HTTP cache configuration in the Semantic Web

Submitted by connolly on Fri, 2006-12-22 19:10. :: |

The W3C Semantic Web Interest Group is considering URI best practices, whether to use LSIDs or HTTP URIs, etc. I ran into some of them at MIT last week. At first it sounded like they wanted some solution so general it would solve the only two hard things in Computer Science: cache invalidation and naming things , as Phil Karlton would say. But then we started talking about a pretty interesting approach: using the semantic web to model cache configuration. It has long been a thorn in my side that there is no standard/portable equivalent ot .htaccess files, no RDF schema for HTTP and MIME, etc.

At WWW9 in May 2000, I gave a talk on formalizing HTTP caching. Where I used larch there, I'd use RDF, OWL, and N3 rules, today. I made some progress in that direction in August 2000: An RDF Model for GET/PUT and Document Management.

Web Architecture: Protocols for State Distribution is a draft I worked on around 1996 to 1999 wihthout ever really finishing it.

I can't find Norm Walsh's item on wwwoffle config, but I did find his XML 2003 paper Caching in with Resolvers:

This paper discusses entity resolvers, caches, and other strategies for dealing with access to sporadically available resources. Our principle focus is on XML Catalogs and local proxy caches. We’ll also consider in passing the ongoing debate of names and addresses, most often arising in the context of URNs vs. URLs.

In Nov 2003 I worked on Web Architecture Illustrated with RDF diagramming tools.

The tabulator, as it's doing HTTP, propagates stuff like content type, last modified, etc. from javascript into its RDF store. Meanwhile, the accessability evaluation and repair folks just released HTTP Vocabulary in RDF. I haven't managed to compare the tabulator's vocabulary with that one yet. I hope somebody does soon.

And while we're doing this little survey, check out the Uri Template stuff by Joe Gregorio and company. I haven't taken a very close look yet, but I suspect it'll be useful for various problems, if not this one in particular.

Is it now illegal to link to copyrighted material in Australia? NO

Submitted by Danny Weitzner on Wed, 2006-12-20 12:26. ::

The original appearance of this entry was in Danny Weitzner - Open Internet Policy

There’s been a lot of coverage (Sydney Morning Herald, Copyright ruling puts hyperlinking on notice, 19 December 2006) about a recent copyright case from the Australia Federal Court. This is an important case but to my reading the decision itself, it’s a mistake to see it as a general rule against linking to copyrighted material, as some of the press coverage suggests. Of course, it would cripple the Web if it became illegal to merely link to copyrighted material. As virtually all Web pages are copyrighted by someone, a rule that any link is an invitation to engage in copyright violation would mean one could only link to pages with permission. That would, indeed, break the Web.

But that is not was this case seems to say. From an admittedly cursory reading of the opinion, the Australia court seems to have tied it’s decision to that fact that:

“…it was the deliberate choice of Mr Cooper to establish and maintain his website in a form which did not give him the power immediately to prevent, or immediately to restrict, internet users from using links on his website to access remote websites for the purpose of copying sound recordings in which copyright subsisted.” (41)*

and the court went on to accept the trial courts finding that:

“… Mr Cooper [the defendant and operator of site] benefited financially from sponsorship and advertisements on the website; that is, that the relationship between Mr Cooper and the users of his website had a commercial aspect. Mr Cooper’s benefits from advertising and sponsorship may be assumed to have been related to the actual or expected exposure of the website to internet users. As a consequence Mr Cooper had a commercial interest in attracting users to his website for the purpose of copying digital music files.” (48)

To boil it down, though Cooper didn’t actually have the power to spot people from illegally copying the MP34 files to which he provided links, his intent was that people engage in copying he knew to be illegal and that he actually benefited from that behavior.

The court also addressed the defendants argument that a ruling against him could also outlaw search engines in Australia. The court said: “Google is a general purpose search engine rather than a website designed to facilitate the downloading of music files”

Copyright law has developed elaborate doctrine in order to try to determine when to punish those who have some role in enabling infringement as opposed to those who are the actual infringers. I’m not sure that that balance is always right, but this case, similar to the US Supreme Court case MGM v. Grokster is an effort to find a way to indicate when linking to copyrighted material goes beyond building the Web and violates the law. I’m not always happy about where that line is drawn, but it’s a lot more subtle than the simple technical question whether a link is provided or not.

* note that the Australia courts have adopted the enlightened practice of using paragraph numbers to refer inside an opinion, rather than relying on page numbers which neither work well with digital copies (such as web pages that lack pagination) and which give certain legal publishes undue control over search/retrieval services for legal documents.

Drupal upgrade

Submitted by ryanlee on Fri, 2006-12-01 21:15. ::

I've upgraded the Drupal installation so we can use the OpenID module. A few things learned in the process:

  • The Drupal upgrade path requires incremental steps; to go from one minor version to another two numbers way means upgrading through every intermediate minor version until reaching the target. Earlier versions fail to be useful in 'knowing' which data model version the system is at, so an upgrade meant importing / guessing / dropping and repeating the cycle until I hit on the right one, which in itself was not an easy state to assess.
  • The module administration page loads every module, which can cause memory issues resulting in a blank page. Removing unnecessary, unused modules helps.
  • The JanRain OpenID 1.2.0 pear installation fails to install itself properly, requiring the moving of directories post-install.
  • The OpenID module does not respect settings on account creation. I wrote some code to fix this.

But we're now at the latest Drupal version; and more on OpenID later.

Addendum: And apparently I needed to enable the legacy module. Thanks to those who pointed out the symptoms. 'I' in this case is solely Ryan, not Tim, Dan, or anybody else; send any of your issues with the upgrade my way.

A new Basketball season brings a new episode in the personal information disaster

Submitted by connolly on Thu, 2006-11-16 12:39. :: | | | |

Basketball season is here. Time to copy my son's schedule to my PDA. The organization that runs the league has their schedules online. (yay!) in HTML. (yay!). But with events separated by all <br>s rather than enclosed in elements. (whimper). Even after running it thru tidy, it looks like:

<br />
<b>Event Date:</b> Wednesday, 11/15/2006<br>
<b>Start Time:</b> 8:15<br />
<b>End Time:</b> 9:30<br />
<br />
<b>Event Date:</b> Wednesday, 11/8/2006<br />
<b>Start Time:</b> 8:15<br />

So much for XSLT. Time for a nasty perl hack.

Or maybe not. Between my no more undocumented, untested code new year's resolution and the maturity of the python libraries, my usual doctest-driven development worked fine; I was able to generate JSON-shaped structures without hitting that oh screw it; I'll just use perl point; the gist of the code is:

def main(argv):
    dataf, tplf = argv[1], argv[2]
    tpl = kid.Template(file=tplf) = eachEvent(file(dataf))

    for s in tpl.generate(output='xml', encoding='utf-8'):

def eachEvent(lines):
    """turn an iterator over lines into an iterator over events
    for l in lines:
        if 'Last Name' in l:
            surname = findName(l)
            e = mkobj("practice", "Practice w/%s" % surname)
        elif 'Event Date' in l:
            if 'dtstart' in e:
                yield e
                e = mkobj("practice", "Practice w/%s" % surname)
            e['date'] = findDate(l)
        elif 'Start Time' in l:
            e['dtstart'] = e['date'] + "T" + findTime(l)
        elif 'End Time' in l:
            e['dtend'] = e['date'] + "T" + findTime(l)

next = 0
def mkobj(pfx, summary):
    global next
    next += 1
    return {'id': "%s%d" % (pfx, next),
            'summary': summary,

def findTime(s):
    >>> findTime("<b>Start Time:</b> 8:15<br />")
    >>> findTime("<b>End Time:</b> 9:30<br />")
    m ="(\d+):(\d+)", s)
    hh, mm = int(, int(
    return "%02d:%02d:00" % (hh + 12, mm)


It uses my palmagent hackery: event-rdf.kid to produce RDF/XML which can upload to my PDA. I also used the event.kid template to generate an hCalendar/XHTML version for archival purposes, though I didn't use that directly to feed my PDA.

The development took half an hour or so squeezed into this morning:

changeset:   5:7d455f25b0cc
user:        Dan Connolly
date:        Thu Nov 16 11:31:07 2006 -0600
summary:     id, seconds in time, etc.

changeset:   2:2b38765cec0f
user:        Dan Connolly
date:        Thu Nov 16 09:23:15 2006 -0600
summary:     finds date, dtstart, dtend, and location of each event

changeset:   1:e208314f21b2
user:        Dan Connolly
date:        Thu Nov 16 09:08:01 2006 -0600
summary:     finds dates of each event

Celebrating OWL interoperability and spec quality

Submitted by connolly on Sat, 2006-11-11 00:29. :: | |

In a Standards and Pseudo Standards item in July, Holger Knublauch gripes that SQL interoperability is still tricky after all these years, and UML is still shaking out bugs, while RDF and OWL are really solid. I hope GRDDL and SPARQL will get there soon too.

At the OWL: Experiences and Directions workshop in Athens today, as the community gathered to talk about problems they see with OWL and what they'd like to add to OWL, I felt compelled to point out (using a few slides) that:

  • XML interoperability is quite good and tools are pretty much ubiquitous, but don't forget the XML Core working group has fixed over 100 errata in the specifications since they were originally adopted in 1998.
  • HTML interoperability is a black art; the specification is only a small part of what you need to know to build interoperable tools.
  • XML Schema interoperability is improving, but interoperability problem reports are still fairly common, and it's not always clear from the spec which tool is right when they disagree.

And while the OWL errata do include a repeated sentence and a missing word, there have been no substantive problems reported in the normative specifications.

How did we do that? The OWL deliverables include:

OWL test results screenshot

Jeremy and Jos did great work on the tests. And Sandro's approach to getting test results back from the tool developers was particularly inspired. He asked them to publish their test results as RDF data in the web. Then he provided immediate feedback in the form of an aggregate report that included updates live. After our table of test results had columns from one or two tools, several other developers came out of the woodwork and said "here are my results too." Before long we had results from a dozen or so tools and our implementation report was compelling.

The GRDDL tests are coming along nicely; Chime's message on implementation and testing shows that the spec is quite straightforward to implement, and he updated the test harness so that we should be able to support Evaluation and Report Language (EARL) soon.

SPARQL looks a bit more challenging, but I hope we start to get some solid reports from developers about the SPARQL test collection soon too.

tags: QA, GRDDL, SPARQL, OWL, RDF, Semantic Web

Blogging is great

Submitted by timbl on Fri, 2006-11-03 10:11. ::

People have, since it started, complained about the fact that there is junk on the web. And as a universal medium, of course, it is important that the web itself doesn't try to decide what is publishable. The way quality works on the web is through links.

It works because reputable writers make links to things they consider reputable sources. So readers, when they find something distasteful or unreliable, don't just hit the back button once, they hit it twice. They remember not to follow links again through the page which took them there. One's chosen starting page, and a nurtured set of bookmarks, are the entrance points, then, to a selected subweb of information which one is generally inclined to trust and find valuable.

A great example of course is the blogging world. Blogs provide a gently evolving network of pointers of interest. As do FOAF files. I've always thought that FOAF could be extended to provide a trust infrastructure for (e..g.) spam filtering and OpenID-style single sign-on and its good to see things happening in that space.

In a recent interview with the Guardian, alas, my attempt to explain this was turned upside down into a "blogging is one of the biggest perils" message. Sigh. I think they took their lead from an unfortunate BBC article, which for some reason stressed concerns about the web rather than excitement, failure modes rather than opportunities. (This happens, because when you launch a Web Science Research Initiative, people ask what the opportunities are and what the dangers are for the future. And some editors are tempted to just edit out the opportunities and headline the fears to get the eyeballs, which is old and boring newspaper practice. We expect better from the Guardian and BBC, generally very reputable sources)

In fact, it is a really positive time for the web. Startups are launching, and being sold [Disclaimer: people I know] again, academics are excited about new systems and ideas, conferences and camps and wikis and chat channels and are hopping with energy, and every morning demands an excruciating choice of which exciting link to follow first.

And, fortunately, we have blogs. We can publish what we actually think, even when misreported.

Reinventing HTML

Submitted by timbl on Fri, 2006-10-27 16:14. ::

Making standards is hard work. Its hard because it involves listening to other people and figuring out what they mean, which means figuring out where they are coming from, how they are using words, and so on.

There is the age-old tradeoff for any group as to whether to zoom along happily, in relative isolation, putting off the day when they ask for reviews, or whether to get lots of people involved early on, so a wider community gets on board earlier, with all the time that costs. That's a trade-off which won't go away.

The solutions tend to be different for each case, each working group. Some have lots of reviewers and some few, some have lots of time, some urgent deadlines.

A particular case is HTML. HTML has the potential interest of millions of people: anyone who has designed a web page may have useful views on new HTML features. It is the earliest spec of W3C, a battleground of the browser wars, and now the most widespread spec.

The perceived accountability of the HTML group has been an issue. Sometimes this was a departure from the W3C process, sometimes a sticking to it in principle, but not actually providing assurances to commenters. An issue was the formation of the breakaway WHAT WG, which attracted reviewers though it did not have a process or specific accountability measures itself.

There has been discussion in blogs where Daniel Glazman, Björn Hörmann, Molly Holzschlag, Eric Meyer, and Jeffrey Zeldman and others have shared concerns about W3C works particularly in the HTML area. The validator and other subjects cropped up too, but let's focus on HTML now. We had a W3C retreat in which we discussed what to do about these things.

Some things are very clear. It is really important to have real developers on the ground involved with the development of HTML. It is also really important to have browser makers intimately involved and committed. And also all the other stakeholders, including users and user companies and makers of related products.

Some things are clearer with hindsight of several years. It is necessary to evolve HTML incrementally. The attempt to get the world to switch to XML, including quotes around attribute values and slashes in empty tags and namespaces all at once didn't work. The large HTML-generating public did not move, largely because the browsers didn't complain. Some large communities did shift and are enjoying the fruits of well-formed systems, but not all. It is important to maintain HTML incrementally, as well as continuing a transition to well-formed world, and developing more power in that world.

The plan is to charter a completely new HTML group. Unlike the previous one, this one will be chartered to do incremental improvements to HTML, as also in parallel xHTML. It will have a different chair and staff contact. It will work on HTML and xHTML together. We have strong support for this group, from many people we have talked to, including browser makers.

There will also be work on forms. This is a complex area, as existing HTML forms and XForms are both form languages. HTML forms are ubiquitously deployed, and there are many implementations and users of XForms. Meanwhile, the Webforms submission has suggested sensible extensions to HTML forms. The plan is, informed by Webforms, to extend HTML forms. At the same time, there is a work item to look at how HTML forms (existing and extended) can be thought of as XForm equivalents, to allow an easy escalation path. A goal would be to have an HTML forms language which is a superset of the existing HTML language, and a subset of a XForms language wit added HTML compatibility. We will see to what extend this is possible. There will be a new Forms group, and a common task force between it and the HTML group.

There is also a plan for a separate group to work on the XHTML2 work which the old "HTML working group" was working on. There will be no dependency of HTML work on the XHTML2 work.

As well as a new HTML work, there are other things want to change. The validator I think is a really valuable tool both for users and in helping standards deployment. I'd like it to check (even) more stuff, be (even) more helpful, and prioritize carefully its errors, warning and mild chidings. I'd like it to link to an explanations of why things should be a certain way. We have, by the way, just ordered some new server hardware, paid for by the Supporters program -- thank you!

This is going to be hard work. I'd like everyone to go into this realizing this. I'll be asking these groups to be very accountable, to have powerful issue tracking systems on the web site, and to be responsive in spirit as well as in letter to public comments. As always, we will be insisting on working implementations and test suites. Now we are going to be asking for things like talking with validator developers, maybe providing validator modules and validator test suites. (That's like a language test suite but backwards, in a way). I'm going to ask commenters to be respectful of the groups, as always. Try to check whether the comment has been made before, suggest alternative text, one item per message, etc, and add to technical perception social awareness.

This is going to be a very major collaboration on a very important spec, one of the crown jewels of web technology. Even though hundreds of people will be involved, we are evolving the technology which millions going on billions will use in the future. There won't seem like enough thankyous to go around some days. But we will be maintaining something very important and creating something even better.

Tim BL

p.s. comments are disabled here in breadcrumbs, the DIG research blog, but they are welcome in the W3C QA weblog.

Now is a good time to try the tabulator

Submitted by connolly on Thu, 2006-10-26 11:40. :: | | |

Tim presented the tabulator to the W3C team today; see slides: Tabulator: AJAX Generic RDF browser.

The tabulator was sorta all over the floor when I tried to present it in Austin in September, but David Sheets put it back together in the last couple weeks. Yay David!

In particular, the support for viewing the HTTP data that you pick up by tabulating is working better than ever before. The HTTP vocabulary has URIs like That seems like an interesting contribution to the WAI ER work on HTTP Vocabulary in RDF.

Note comments are disabled here in breadcrumbs until we figure out OpenID comment policies and drupal etc.. The tabulator issue tracker is probably a better place to report problems anyway. We don't have OpenID working there yet either, unfortunately, but we do support email callback based account setup.

Adding Shoenfield, Brachman books to my bookshelf?

Submitted by connolly on Mon, 2006-10-16 13:08. ::

In discussion following my presentation to the ACL 2 seminar, Bob Boyer said "Shoenfield is the bible." And while talking with Elaine Rich before my Dean's Scholars lunch presentation, she strongly recommended Brachman's KR book.

This brought John Udell's library lookup gizmo out of the someday pile for a try. Support for the libraries near me didn't work out of the box. It's reasonably clear how to get it working, but a manual search showed that the libraries don't have the books anyway.

In the research library symposium I learned that even some on-campus researchers find it easier to buy books thru Amazon than to use their campus library. I have no trouble adding these two books to my amazon wishlist, but I hesitate to actually order them.

I make time for novels by McMurtry and Crichton and such, but the computer book industry is full of stuff that is rushed to market. If there's a topic that I'm really interested in, I can download the source the day it's released or follow the mailing list archives and weblogs pretty directly. By the time it has come out in book form, the chance to influence the direction has probably passed. So when I consider having my employer buy a book for my bookshelf, I have a hard time justifying more than $20 or so.

I devote one shelf to books that were part of my education... starting with The Complete Book of Model Car Building and A Field Guide to Rocks and Minerals thru The Facts for the Color Computer, OS/9, GoedelEscherBach, SmallTalk-80 and so on thru POSIX 1003.1. You might try the MyLowestBookshelf excercise. I had some fun with it a few years ago.

I do keep one shelf of Web and XML and networking books; not so much so that I can refer to their contents but rather to commemorate working with the people who wrote them.

I have Lessig's Free Culture on the top shelf, i.e. the "you really should have read this by now" guilt-pile. But I had better luck making time to listen to the recording of his Wikimania talk over the web.

For current events and new developments, I'll probably stick with online sources. But I think I'll order these two books; they seem to have stuff I can't get online.

postscript: for an extra $13.99, amazon offers to let me start reading the Brachman right now with their Amazon Online Reader. Hmm...

Wishing for XOXO microformat support in OmniOutliner

Submitted by connolly on Sat, 2006-09-16 21:46. :: |

OmniOutliner is great; I used it to put together the panel remarks I mentioned an earlier breadcrumbs item.

The XHTML export support is quite good, and even their native format is XML. But it's a bit of a pain to remember to export each time I save so that I can keep the ooutline version and the XHTML version in sync.

Why not just use an XHTML dialect like XOXO as the native format? I wonder what, if anything, their native format records that XOXO doesn't cover and couldn't be squeezed into XHTML's spare bandwidth somewhere.

There's also the question of whether XOXO can express anything that OmniOutliner can't handle, I suppose.

Blogged with Flock

Syndicate content