The details of data in documents; GRDDL, profiles, and HTML5

Submitted by connolly on Fri, 2008-08-22 14:09. :: | |
GRDDL, a mechanism for putting RDF data in XML/XHTML documents, is specified mostly at the XPath data model level. Some GRDDL software goes beyond XML and supports HTML as she are spoke, aka tag soup. HTML 5 is intended to standardize the connection between tag soup and XPath. The tidy use case for GRDDL anticipates that using HTML 5 concrete syntax rather than XHTML 1.x concrete syntax involves no changes at the XPath level

But in GRDDL and HTML5, Ian Hickson, editor of HTML 5, advocates dropping the profile attribute of the HTML head element in favor of rel="profile" or some such. I dropped by the #microformats channel to think out loud about this stuff, and Tantek said similarly, "we may solve this with rel="profile" anyway." The rel-profile topic in the microformats wiki shows the idea goes pretty far back.

Possibilities I see include:
  • GRDDL implementors add support for rel="profile" along with HTML 5 concrete syntax
  • GRDDL implementors don't change their code, so people who want to use GRDDL with HTML 5 features such as <video> stick to XML-wf-happy HTML 5 syntax and they use the head/@profile attribute anyway, despite what the HTML 5 spec says.
  • People who want to use GRDDL stick to XHTML 1.x.
  • People who want to put data in their HTML documents use RDFa.

I don't particularly care for the rel="profile" design, but one should choose ones battles and I'm not inclined to choose this one. I'm content for the market to choose.


A new Basketball season brings a new episode in the personal information disaster

Submitted by connolly on Thu, 2006-11-16 12:39. :: | | | |

Basketball season is here. Time to copy my son's schedule to my PDA. The organization that runs the league has their schedules online. (yay!) in HTML. (yay!). But with events separated by all <br>s rather than enclosed in elements. (whimper). Even after running it thru tidy, it looks like:

<br />
<b>Event Date:</b> Wednesday, 11/15/2006<br>
<b>Start Time:</b> 8:15<br />
<b>End Time:</b> 9:30<br />
<br />
<b>Event Date:</b> Wednesday, 11/8/2006<br />
<b>Start Time:</b> 8:15<br />

So much for XSLT. Time for a nasty perl hack.

Or maybe not. Between my no more undocumented, untested code new year's resolution and the maturity of the python libraries, my usual doctest-driven development worked fine; I was able to generate JSON-shaped structures without hitting that oh screw it; I'll just use perl point; the gist of the code is:

def main(argv):
    dataf, tplf = argv[1], argv[2]
    tpl = kid.Template(file=tplf)
    tpl.events = eachEvent(file(dataf))

    for s in tpl.generate(output='xml', encoding='utf-8'):

def eachEvent(lines):
    """turn an iterator over lines into an iterator over events
    for l in lines:
        if 'Last Name' in l:
            surname = findName(l)
            e = mkobj("practice", "Practice w/%s" % surname)
        elif 'Event Date' in l:
            if 'dtstart' in e:
                yield e
                e = mkobj("practice", "Practice w/%s" % surname)
            e['date'] = findDate(l)
        elif 'Start Time' in l:
            e['dtstart'] = e['date'] + "T" + findTime(l)
        elif 'End Time' in l:
            e['dtend'] = e['date'] + "T" + findTime(l)

next = 0
def mkobj(pfx, summary):
    global next
    next += 1
    return {'id': "%s%d" % (pfx, next),
            'summary': summary,

def findTime(s):
    >>> findTime("<b>Start Time:</b> 8:15<br />")
    >>> findTime("<b>End Time:</b> 9:30<br />")
    m = re.search(r"(\d+):(\d+)", s)
    hh, mm = int(m.group(1)), int(m.group(2))
    return "%02d:%02d:00" % (hh + 12, mm)


It uses my palmagent hackery: event-rdf.kid to produce RDF/XML which hipAgent.py can upload to my PDA. I also used the event.kid template to generate an hCalendar/XHTML version for archival purposes, though I didn't use that directly to feed my PDA.

The development took half an hour or so squeezed into this morning:

changeset:   5:7d455f25b0cc
user:        Dan Connolly http://www.w3.org/People/Connolly/
date:        Thu Nov 16 11:31:07 2006 -0600
summary:     id, seconds in time, etc.

changeset:   2:2b38765cec0f
user:        Dan Connolly http://www.w3.org/People/Connolly/
date:        Thu Nov 16 09:23:15 2006 -0600
summary:     finds date, dtstart, dtend, and location of each event

changeset:   1:e208314f21b2
user:        Dan Connolly http://www.w3.org/People/Connolly/
date:        Thu Nov 16 09:08:01 2006 -0600
summary:     finds dates of each event

Wishing for XOXO microformat support in OmniOutliner

Submitted by connolly on Sat, 2006-09-16 21:46. :: |

OmniOutliner is great; I used it to put together the panel remarks I mentioned an earlier breadcrumbs item.

The XHTML export support is quite good, and even their native format is XML. But it's a bit of a pain to remember to export each time I save so that I can keep the ooutline version and the XHTML version in sync.

Why not just use an XHTML dialect like XOXO as the native format? I wonder what, if anything, their native format records that XOXO doesn't cover and couldn't be squeezed into XHTML's spare bandwidth somewhere.

There's also the question of whether XOXO can express anything that OmniOutliner can't handle, I suppose.

Blogged with Flock

converting vcard .vcf syntax to hcard and catching up on CALSIFY

Submitted by connolly on Thu, 2006-06-29 00:17. :: | | | |

A while back I wrote about using JSON and templates to produce microformat data. I swapped some of those ideas in today while trying to figure out a simple, consistent model for recurring events using floating times plus locations.

I spent a little time catching up on the IETF CALSIFY WG; they meet Wednesday, July 12 at 9am in Montreal. I wonder how much progress they'll make on issues like the March 2007 DST change and the CalConnect recommendations toward an IANA timezone registry.

When I realized I didn't have a clear problem or use case in mind, I went looking for something that I could chew on in test-driven style.

So I picked up the hcard tests and built a vcard-to-hcard converter sort of out of spare parts. icslex.py handles low-level lexical details of iCalendar, which turn out to have lots in common with vCard: line breaking, escaping, that sort of thing. On top of that, I wrote vcardin.py, which has enough vcard schema smarts to break down the structured N and ADR and GEO properties so there's no microparsing below the JSON level. Then contacts.kid is a kid template that spits out the data following the hcard spec.

It works like this:

python vcardin.py contacts.kid hcard/01-tantek-basic.vcf >,01.html

Then I used X2V to convert the results back to .vcf format and compared them using hcard testing tools (normalize.pl and such) fixed the breakage. Lather, rinse, repeat... I have pretty much all the tests working except 16-honorific-additional-multiple.

It really is a pain to set up a loop for the additional-name field when that field is almost never used, let alone used with multiple values. This sort of structure is more natural in XML/XHTML/hCard than in JSON, in a way. And if I change the JSON structure from a string to a list, does that mean the RDF property should use a list/collection? Probably not... I probably don't need additional-name to be an owl:FunctionalProperty.

Hmm... meanwhile, this contacts.kid template should mesh nicely with my wikipedia airport scraping hack...

See also: IRC notes from #microformats, from #swig.

citing W3C specs from WWW conference papers

Submitted by connolly on Tue, 2006-04-25 10:19. :: | | |

As I said in a July 2000 message to www-rdf-interest:

There are very few data formats I trust... when I use when I use the computer to capture my knowledge, I pretty much stick to plain text, XML (esp XHTML, or at least HTML that tidy can turn into XHTML for me), RCS/CVS, and RFC822/MIME. I use JPG, PNG, and PDF if I must, but not for capturing knowledge for exchange, revision, etc.

And as I explained in a 1994 essay, converting from LaTeX is hard, so I try not to write in LaTeX either.

The Web conference has instructions for submitting PDF using LaTeX or MS Word and (finally!) for submitting XHTML. (The WWW2006 paper CSS stylesheet is horrible... who wants to read 9pt times on screen?!?! Anyway...) So when the IRW 2006 organizers told me they'd like a PDF version of my paper in that style, I dusted off my Transforming XHTML to LaTeX and BibTeX tools and got to work.

My paper cites a number of W3C specs, including HTML 4. The W3C tech reports index/digital library has an associated bibliography generator. I fed it http://www.w3.org/TR/html401 and it generated a nice bibliographic reference from an RDF data set. I'm interested in the ongoing citation microformats work that might make that transformation lossless, since I need not just XHTML, but BibTex. What I'm doing currently is adding some bibtex vocabulary in class and rel attributes:

<dt class="TechReport">
<a name="HTML4" id="HTML4">[HTML4]</a>

<dd><span class="author">Le Hors, Arnaud and Raggett, Dave and
Jacobs, Ian</span> Editors,
<cite> <a
href="http://www.w3.org/TR/1999/REC-html401-19991224">HTML 4.01
Specification</a> </cite>,
<span class="institution">W3C</span> Recommendation,
24 <span class="month">December</span> <span class="year">1999</span>,
<tt class="number">http://www.w3.org/TR/1999/REC-html401-19991224</tt>.
<a href="http://www.w3.org/TR/html401" title="Latest version of
HTML 4.01 Specification">Latest version</a> available at
http://www.w3.org/TR/html401 .</dd>

When run thru my xh2bib.xsl, out comes:

title = "{
HTML 4.01 Specification
    author = {Le Hors, Arnaud
and Raggett, Dave
and Jacobs, Ian},
    institution = {W3C},
    month = {December},
    year = {1999},
    number = {http://www.w3.org/TR/1999/REC-html401-19991224},
    howpublished = { \url{http://www.w3.org/TR/1999/REC-html401-19991224} }

I think I should be using editor = rather than author = but that didn't work the 1st time I tried and I haven't investigated further.

In any case, I'm reasonably happy with the PDF output.

using JSON and templates to produce microformat data

Submitted by connolly on Sun, 2006-03-19 04:09. :: | | | |

In Getting my Personal Finance data back with hCalendar and hCard, I discussed using JSON-style records as an intermediate structure between tab-separated transaction report data and hCalendar. I just took it a step further in palmagent; hipsrv.py uses kid templates, so the markup can be tweaked independently of the normalization and SPARQL-like filtering logic. I expect to be able to do RDF/XML output thru templates too.

Working at the JSON level is nice; when I want to make a list of 3 numbers, I can just do that, unlike in XML where I have to make up names and think about whether to use a space-separated microparsed attribute value or a massively redundant element structure.

It brings me back to my March 1997 essay for the Web Apps issue on Distributed Objects and noodling on VHLL types in ILU.

A look at emerging Web security architectures from a Semantic Web perspective

Submitted by connolly on Fri, 2006-03-17 17:51. :: | | | | | |

W3C had a workshop, Toward a more Secure Web this week. Citigroup hosted; the view from the 50th floor was awesome.

Some notes on the workshop are taking shape:

A look at emerging Web security architectures from a Semantic Web perspective

Comparing OpenID, SXIP/DIX, InfoCard, SAML to RDF, GRDDL, FOAF, P3P, XFN and hCard

At the W3C security workshop this week, I finally got to study SXIP in some detail after hearing about it and wondering how it compares to OpenID, Yadis, and the other "Identity 2.0" techniques brewing. And just in time, with a DIX/SXIP BOF at the Dallas IETF next week.

Reflections on the W3C Technical Plenary week

Submitted by connolly on Tue, 2006-03-07 20:31. :: | | | | | | |

Here comes (some of) the TAG
Originally uploaded by Norm Walsh.

The last item on the agenda of the TAG meeting in France was "Reviewing what we have learned during a full week of meetings". I proposed that we do it on the beach, and it carried.

By then, the network woes of Monday and Tuesday had largely faded from memory.

I was on two of the plenary day panels. Tantek reports on one of them: Microformats voted best session at W3C Technical Plenary Day!. My presentation in that panel was on GRDDL and microformats. Jim Melton followed with his SPARQL/SQL/XQuery talk. Between the two of them, Noah Mendelsohn said he thought the Semantic Web might just be turning a corner.

My other panel presentation was Feedback loops and formal systems where I talked about UML and OWL after touching on contrast between symbolic approaches like the Semantic Web and statistical approaches like pagerank. Folksonomies are an interesting mixture of both, I suppose. Alistair took me to task for being sloppy with the term "chaotic system"; he's quiet right that complex system is the more appropriate description of the Web.

The TAG discussion of that session started with jokes about how formal systems is soporific enough without putting it right after a big French lunch. TimBL mentioned the scheme denotational semantics, and TV said that Jonathan Rees is now at Creative Commons. News to me. I spent many, many hours poring over his scheme48 code a few years back. I don't think I knew where the name came from until today: Within 48 hours we had designed and implemented a working Scheme, including read, write, byte code compiler, byte code interpreter, garbage collector, and initialization logic.

The SemWeb IG meeting on Thursday was full of fun lightning talks and cool hacks. I led a GRDDL discussion that went well, I think. The SPARQL calendar demo rocked. Great last-minute coding, Lee and Elias!

There and back again

On the return leg of my itinerary, the captain announced the cruising altitude, as usual, and then added ... which means you'll spend most of today 6 miles above the earth.

My travel checklist worked pretty well, with a few exceptions. The postcard thing isn't a habit yet. I forgot a paperback book; that was OK since I slept quite a bit on the way over and I got into the coding zone on the way back more about that later, I hope.

Other Reflections

See also reflections by:

... and stay tuned for something from

See also: Flickr photo group, NCE bookmarks

XHTML for computer science research papers and bibliographies

Submitted by connolly on Thu, 2005-11-03 15:34. :: | | |

In a PAW project discussion of writing assignments for WWW2006, KR, etc., I asked that we use XHTML rather than LaTeX to collaborate on the papers.

The WWW2006 deadline is too soon to make the transition, but I took the source of one of the papers in development and translated it to XHTML in order to test my Transforming XHTML to LaTeX and BibTeX tools. Since the tools have only been tested on one project, of course they needed some tweaks. And they'll need some more for figures.

But I'm hopeful that it'll be cost-effective to do things this way.

Meanwhile, there's a cite-formats discussion in the microformat community. My work includes a microformat for bibliography stuff. I haven't figured out URIs for the properties nor converted it to RDF just yet, like I did for my old index of URI schemes and like we did for automating publication of W3C tech reports.

Syndicate content