microformats

The details of data in documents; GRDDL, profiles, and HTML5

Submitted by connolly on Fri, 2008-08-22 14:09. :: | |
GRDDL, a mechanism for putting RDF data in XML/XHTML documents, is specified mostly at the XPath data model level. Some GRDDL software goes beyond XML and supports HTML as she are spoke, aka tag soup. HTML 5 is intended to standardize the connection between tag soup and XPath. The tidy use case for GRDDL anticipates that using HTML 5 concrete syntax rather than XHTML 1.x concrete syntax involves no changes at the XPath level

But in GRDDL and HTML5, Ian Hickson, editor of HTML 5, advocates dropping the profile attribute of the HTML head element in favor of rel="profile" or some such. I dropped by the #microformats channel to think out loud about this stuff, and Tantek said similarly, "we may solve this with rel="profile" anyway." The rel-profile topic in the microformats wiki shows the idea goes pretty far back.

Possibilities I see include:
  • GRDDL implementors add support for rel="profile" along with HTML 5 concrete syntax
  • GRDDL implementors don't change their code, so people who want to use GRDDL with HTML 5 features such as <video> stick to XML-wf-happy HTML 5 syntax and they use the head/@profile attribute anyway, despite what the HTML 5 spec says.
  • People who want to use GRDDL stick to XHTML 1.x.
  • People who want to put data in their HTML documents use RDFa.

I don't particularly care for the rel="profile" design, but one should choose ones battles and I'm not inclined to choose this one. I'm content for the market to choose.

 

sidekick calendar subscription for SXSW

Submitted by connolly on Sat, 2008-03-08 12:57. :: | |

At a conference, like in a good coding session, it's too easy to lose track of time, so I rely heavily on a PDA to remind me of appointments. The SXSW program has just the features I want:

  • an "add this to my calendar" button next to each session
  • a calendar feed of my choices

But I carry a hiptop, which doesn't support calendar subscription. I could copy-and-paste a few critical sessions to my hiptop, but when the climbing geeks offer an hCalendar feed, it becomes wortwhile to use iCal on the laptop, i.e. something that groks calendar subscription, as the master calendar device.

I have had a system for exporting my mobile calendar as a feed, but it's a tedious 4 step shell command sequence; it's OK once or twice a week, but here at SXSW, I want to sync up several times a day.

I have been moving my palmagent project from shell commands and Makefiles to a RESTful Web service, and this pushed me over the edge to add calendar feed support.

As usual, to pull the data from the hiptop's data servers:

  1. Make a directory to hold hiptop accounts and put it in hip_config.py:
    AccountsDir = "/Users/connolly/Desktop/danger-accts"
  2. Start hipwsgi.py running:
    pbjam:~/projects/palmagent$ python hipwsgi.py &
    Serving HTTP on 0.0.0.0 port 8080 ...
  3. Use dangerSync.py to log in and get some session credentials for half an hou of use:
    ~/Desktop/danger-accts/ACCT $ python ~/projects/palmagent/dangerSync.py \
    --prod --user ACCT \
    --passwd YOUR_PASSWORD_HERE \
    >session-id
  4. Visit http://0.0.0.0:8080/pim/ACCT and hit the Pull button.

Now you have event, task, contact, and note directories containing a JSON file for each record and hipwsgi.py lets you navigate them in a few different ways.

The pull feature is incremental; it grabs just the records that have changed since you previously pulled:

Pull majo from danger hiptop service

back to sync options

event

 

The new feature today is the ical export, linked from the event categories page:

event

back to sync options

 

You can copy the address of that ical export link and subscribe to it from iCal, and bingo, there it is, merged with the SXSW calendar and such.

@@screenshot pending 

 

hAudio for microformats mixtapes, in progress

Submitted by connolly on Thu, 2008-03-06 17:00. ::

I was visiting a friend and I wanted to play Back When I Could Fly and the easiest way was to burn a CD and put it in their CD player and while I was at it I figured I might as well pick a few other songs... a sort of mixtape to say thanks for letting me crash there.

That sort of artifact is too precious to leave locked up in iTunes's proprietary format, even if it is XML; as I said in a July 2000 message

There are very few data formats I trust... when I use
the computer to capture my knowledge, I pretty
much stick to plain text, [X]HTML, and email. I use JPG, PNG, and PDF if I must,
but not for capturing knowledge for exchange, revision, etc.


So I wrote itunekb.py, which reads the iTunes data, picks out one playlist, and writes it out in hAudio format using a genshi template. The result is ordinary HTML at one level:

  1. Poems, Prayers And Promises by John Denver
    4:06 from A Song's Best Friend: The Very Best Of John Denver [Disc 1] (2004)
  2. Did You Feel The Mountains Tremble by Delirious?
    4:42 from WOW Worship: Orange (Disc 1) (2000)
  3. The Reason by Hoobastank
    3:52 from The Reason (2003)
  4. Back When I Could Fly by Trout Fishing In America
    3:29 from Family Music Party (1998)
  5. ...

At another level, it's yummy Semantic Web data.

Oops! Well, it used to be; but hAudio seems to be changing:

Here's hoping I find time to catch up.

Soccer schedules, flight itineraries, timezones, and python web frameworks

Submitted by connolly on Wed, 2007-09-12 17:17. :: | | | |

The schedule for this fall soccer season came out August 11th. I got the itinerary for the trip I'm about to take on July 26. But I just now got them synchronized with the family calendar.

The soccer league publishes the schedule in somewhat reasonable HTML; to get that into my sidekick, I have a Makefile that does these steps:

  1. Use tidy to make the markup well-formed.
  2. Use 100 lines of XSLT (soccer-schedfix.xsl) to add hCalendar markup.
  3. Use glean-hcal.xsl to get RDF calendar data.
  4. Use hipAgent.py to upload the calendar items via XMLRPC to the danger/t-mobile service, which magically updates the sidekick device.

But oops! The timezones come out wrong. Ugh... manually fix the times of 12 soccer games... better than manually keying in all the data... then sync with the family calendar. My usual calendar sync Makefile does the following:

  1. Use dangerSync.py to download the calendar and task data via XMLRPC.
  2. Use hipsrv.py to filter by category=family, convert from danger/sidekick/hiptop conventions to iCalendar standard conventions pour the records into a kid template to produce RDF Calendar (and hCalendar).
  3. Use toIcal.py to convert RDF Calendar to .ics format.
  4. Upload to family WebDAV server using curl.

Then check the results on my mac to make sure that when my wife refreshes her iCal subscriptions it will look right.

Oh no! The timezones are wrong again!

The sidekick has no visible support for timezones, but the start_time and end_time fields in the XMLRPC interface are in Z/UTC time, and there's a timezone field. However, after years with this device, I'm still mystified about how it works. The Makefiles approach is not conducive to tinkering at this level, so I worked on my REST interface, hipwsgi.py until it had crude support for editing records (using JSON syntax in a form field). What I discovered is that once you post an event record with a mixed up timezone, there's no way to fix it. When you use the device UI to change the start time, it looks OK, but the Z time via XMLRPC is then wrong.

So I deleted all the soccer game records, carefully factored the danger/iCalendar conversion code out of hipAgent.py into calitems.py for ease of testing, and goit it working for local Chicago-time events.

Then I went through the whole story again with my itinerary. Just replace tidy and soccer-schedfix.xsl with flightCal.py to get the itinerary from SABRE's text format to hCalendar:

  1. Upload itinerary to the sidekick.
  2. Manually fix the times.
  3. Sync with iCal. Bzzt. Off by several hours.
  4. Delete the flights from the sidekick.
  5. Work on calitems.py some more.
  6. Upload to the sidekick again. Ignore the sidekick display, which is right for the parts of the itinerary in Chicago, but wrong for the others.
  7. Sync with iCal. Win!

I suppose I'm resigned that the only way to get the XMLRPC POST/upload right (the stored Z times, at least, if not the display) is to know what timezone the device is set to when the POST occurs. Sigh.

A March 2005 review corroborates my findings:

The Sidekick and the sync software do not seem to be aware of time zones. That means that your PC and your Sidekick have to be configured for the same time zone when they synchronize, else your appointments will be all wrong.

 

 hipwsgi.py is about my 5th iteration on this idea of a web server interface to my PDA data. It uses WSGI and JSON and Genshi, following Joe G's stuff. Previous itertions include:

  1. pdkb.pl - quick n dirty perl hack (started April 2001)
  2. hipAgent.py - screen scraping (Dec 2002)
  3. dangerSync.py - XMLRPC with a python shelf and hardcoded RDF/XML output (Feb 2004)
  4. hipsrv.py - conversion logic in python with kid templates and SPARQL-like filters over JSON-shaped data (March 2006)
It's pretty raw right now, but fleshing out the details looks like fun. Wish me luck.

A new Basketball season brings a new episode in the personal information disaster

Submitted by connolly on Thu, 2006-11-16 12:39. :: | | | |

Basketball season is here. Time to copy my son's schedule to my PDA. The organization that runs the league has their schedules online. (yay!) in HTML. (yay!). But with events separated by all <br>s rather than enclosed in elements. (whimper). Even after running it thru tidy, it looks like:

<br />
<b>Event Date:</b> Wednesday, 11/15/2006<br>
<b>Start Time:</b> 8:15<br />
<b>End Time:</b> 9:30<br />
...
<br />
<b>Event Date:</b> Wednesday, 11/8/2006<br />
<b>Start Time:</b> 8:15<br />

So much for XSLT. Time for a nasty perl hack.

Or maybe not. Between my no more undocumented, untested code new year's resolution and the maturity of the python libraries, my usual doctest-driven development worked fine; I was able to generate JSON-shaped structures without hitting that oh screw it; I'll just use perl point; the gist of the code is:

def main(argv):
    dataf, tplf = argv[1], argv[2]
    tpl = kid.Template(file=tplf)
    tpl.events = eachEvent(file(dataf))

    for s in tpl.generate(output='xml', encoding='utf-8'):
        sys.stdout.write(s)

def eachEvent(lines):
    """turn an iterator over lines into an iterator over events
    """
    for l in lines:
        if 'Last Name' in l:
            surname = findName(l)
            e = mkobj("practice", "Practice w/%s" % surname)
        elif 'Event Date' in l:
            if 'dtstart' in e:
                yield e
                e = mkobj("practice", "Practice w/%s" % surname)
            e['date'] = findDate(l)
        elif 'Start Time' in l:
            e['dtstart'] = e['date'] + "T" + findTime(l)
        elif 'End Time' in l:
            e['dtend'] = e['date'] + "T" + findTime(l)

next = 0
def mkobj(pfx, summary):
    global next
    next += 1
    return {'id': "%s%d" % (pfx, next),
            'summary': summary,
            }

def findTime(s):
    """
    >>> findTime("<b>Start Time:</b> 8:15<br />")
    '20:15:00'
    >>> findTime("<b>End Time:</b> 9:30<br />")
    '21:30:00'
    """
    m = re.search(r"(\d+):(\d+)", s)
    hh, mm = int(m.group(1)), int(m.group(2))
    return "%02d:%02d:00" % (hh + 12, mm)

...

It uses my palmagent hackery: event-rdf.kid to produce RDF/XML which hipAgent.py can upload to my PDA. I also used the event.kid template to generate an hCalendar/XHTML version for archival purposes, though I didn't use that directly to feed my PDA.

The development took half an hour or so squeezed into this morning:

changeset:   5:7d455f25b0cc
user:        Dan Connolly http://www.w3.org/People/Connolly/
date:        Thu Nov 16 11:31:07 2006 -0600
summary:     id, seconds in time, etc.

changeset:   2:2b38765cec0f
user:        Dan Connolly http://www.w3.org/People/Connolly/
date:        Thu Nov 16 09:23:15 2006 -0600
summary:     finds date, dtstart, dtend, and location of each event

changeset:   1:e208314f21b2
user:        Dan Connolly http://www.w3.org/People/Connolly/
date:        Thu Nov 16 09:08:01 2006 -0600
summary:     finds dates of each event

Wishing for XOXO microformat support in OmniOutliner

Submitted by connolly on Sat, 2006-09-16 21:46. :: |

OmniOutliner is great; I used it to put together the panel remarks I mentioned an earlier breadcrumbs item.

The XHTML export support is quite good, and even their native format is XML. But it's a bit of a pain to remember to export each time I save so that I can keep the ooutline version and the XHTML version in sync.

Why not just use an XHTML dialect like XOXO as the native format? I wonder what, if anything, their native format records that XOXO doesn't cover and couldn't be squeezed into XHTML's spare bandwidth somewhere.

There's also the question of whether XOXO can express anything that OmniOutliner can't handle, I suppose.

Blogged with Flock

Talking with U.T. Austin students about the Microformats, Drug Discovery, the Tabulator, and the Semantic Web

Submitted by connolly on Sat, 2006-09-16 21:36. :: | | | | | |

Working with the MIT tabulator students has been such a blast that while I was at U.T. Austin for the research library symposium, I thought I would try to recruit some undergrads there to get into it. Bob Boyer invited me to speak to his PHL313K class on why the heck they should learn logic, and Alan Cline invited me to the Dean's Scholars lunch, which I used to attend when I was at U.T.

To motivate logic in the PHL313K class, I started with their experience with HTML and blogging and explained how the Semantic Web extends the web by looking at links as logical propositions. cal screen shot I used my XML 2005 slides to talk a little bit about web history and web architecture, and then I moved into using hCalendar (and GRDDL, though I left that largely implicit) to address the personal information disaster. This was the first week or so of class and they had just started learning propositional logic, and hadn't even gotten as far as predicate calculus where atomic formulas like those in RDF show up. And none of them had heard of microformats. I promised not to talk for the full hour but then lost track of time and didn't get to the punch line, "so the computer tells you that no, you can't go to both the conference and Mom's birthday party because you can't be in two places at once" until it was time for them to head off to their next class.

One student did stay after to pose a question that is very interesting and important, if only tangentially related to the Semantic Web: with technology advancing so fast, how do you maintain balance in life?

While Boyer said that talk went well, I think I didn't do a very good job of connecting with them; or maybe they just weren't really awake; it was an 8am class after all. At the Dean's Scholars lunch, on the other hand, the students were talking to each other so loudly as they grabbed their sandwiches that Cline had to really work to get the floor to introduce me as a "local boy done good." They responded with a rousing ovation.

Elaine Rich had provided the vital clue for connecting with this audience earlier in the week. She does AI research and had seen TimBL's AAAI talk. While she didn't exactly give the talk four stars overall, she did get enough out of it to realize it would make an interesting application to add to a book that she's writing, where she's trying to give practical examples that motivate automata theory. So after I took a look at what she had written about URIs and RDF and OWL and such, she reminded me that not all the Deans Scholars are studying computer science; but many of them do biology, and I might do well to present the Semantic Web more from the perspective of that user community.

So I used TimBL's Bio-IT slides. They weren't shy when I went too fast with terms like hypertext, and there were a lot of furrowed brows for a while. But when I got to the FOAFm OMM, UMLS, SNP, Uniprot, Bipax, Patents all have some overlap with drug target ontology drug discovery diagram, I said I didn't even know some of these words and asked them which ones they knew. After a chuckle about "drug", one of them explained about SNP, i.e. single nucleotide polymorphism and another told me about OMM and the discussion really got going. I didn't make much more use of Tim's slides. One great question about integrating data about one place from lots of sources prompted me to tempt the demo gods and try the tabulator. The demo gods were not entirely kind; perhaps I should have used the released version rather than the development version. But I think I did give them a feel for it. In answer to "so what is it you're trying to do, exactly?" I gave a two part answer:

  1. Recruit some of them to work on the tabulator so that their name might be on the next paper like the SWUI06 paper, Tabulator: Exploring and Analyzing linked data on the Semantic Web.
  2. Integrate data accross applications and accross administrative boundaries all over the world, like the Web has done for documents.

We touched on the question of local and global consistency, and someone asked if you can reason about disagreement. I said that yes, I had presented a paper in Edinburgh just this May that demonstrated formally a disagreement between several parties

One of the last questions was "So what is computer science research anway?" which I answered by appeal to the DIG mission statement:

The Decentralized Information Group explores technical, institutional and public policy questions necessary to advance the development of global, decentralized information environments.

And I said how cool it is to have somebody in the TAMI project with real-world experience with the privacy act. One student followed up and asked if we have anybody with real legal background in the group, and I pointed him to Danny. He asked me afterward how to get involved, and it turned out that IRC and freenode are known to him, so the #swig channel was in our common neighborhood in cyberspace, even geography would separate us as I headed to the airport to fly home.

technorati tags:, ,

Blogged with Flock

converting vcard .vcf syntax to hcard and catching up on CALSIFY

Submitted by connolly on Thu, 2006-06-29 00:17. :: | | | |

A while back I wrote about using JSON and templates to produce microformat data. I swapped some of those ideas in today while trying to figure out a simple, consistent model for recurring events using floating times plus locations.

I spent a little time catching up on the IETF CALSIFY WG; they meet Wednesday, July 12 at 9am in Montreal. I wonder how much progress they'll make on issues like the March 2007 DST change and the CalConnect recommendations toward an IANA timezone registry.

When I realized I didn't have a clear problem or use case in mind, I went looking for something that I could chew on in test-driven style.

So I picked up the hcard tests and built a vcard-to-hcard converter sort of out of spare parts. icslex.py handles low-level lexical details of iCalendar, which turn out to have lots in common with vCard: line breaking, escaping, that sort of thing. On top of that, I wrote vcardin.py, which has enough vcard schema smarts to break down the structured N and ADR and GEO properties so there's no microparsing below the JSON level. Then contacts.kid is a kid template that spits out the data following the hcard spec.

It works like this:

python vcardin.py contacts.kid hcard/01-tantek-basic.vcf >,01.html

Then I used X2V to convert the results back to .vcf format and compared them using hcard testing tools (normalize.pl and such) fixed the breakage. Lather, rinse, repeat... I have pretty much all the tests working except 16-honorific-additional-multiple.

It really is a pain to set up a loop for the additional-name field when that field is almost never used, let alone used with multiple values. This sort of structure is more natural in XML/XHTML/hCard than in JSON, in a way. And if I change the JSON structure from a string to a list, does that mean the RDF property should use a list/collection? Probably not... I probably don't need additional-name to be an owl:FunctionalProperty.

Hmm... meanwhile, this contacts.kid template should mesh nicely with my wikipedia airport scraping hack...

See also: IRC notes from #microformats, from #swig.

RDF, Microformats, and Javascript hacking in person at the 'tute

Submitted by connolly on Thu, 2006-05-04 16:14. :: | | | |

My regular schedule of working group meetings and conferences had a gap in April, and my list of reasons to chat with Ben was growing, and we're recruiting some UROPs to work on the tabulator project this summer, so I flew up for a visit to MIT.

I didn't have any particular appointments the first day, so I used the few spare minutes on the T between the airport and MIT to scare up contacts using my handheld gizmo. It turned out Aaron was in town and available for lunch in Harvard Square. We talked about life in start-ups, standards orgs, and research. He suggested layout stuff from Java and Apple should make its way into CSS and offered to write up a few details.

I spent much of Thursday with Ben working on javascript hacks to explore calendar data in RDFa. We did some whiteboard noodling about RDFa and microformats. He showed me the JavaScript shell, which is pretty cool... it gives a read-eval-print loop and tab completion in firefox... just like a lisp machine ;-) Elias dropped by and mixed in some javascript hacking he's been doing to connect SPARQL with the google calendar. Ben and I didn't get around converting my itinerary to RDFa like we planned, but we got pretty close; he sent out a New RDFa demo the next day. That same day, he came down to meet with me again, but I had to go work on a DARPA report, so we were trying to figure out next steps, and he came up with a cool idea and sent it out: Proposal: hGRDDL, an extraction from Microformats to RDFa. Elias and I did some whiteboard noodling too, in the neigborhood of JSON and templates and microformats.

Elias is learning more than he ever wanted to about calendars and timezones. It's like Dougal Campbell said to microformats-discuss:

My server is in timezone A, but I live in timezone B, but I'm posting information about an event that will occur in timezone C. Shoot me now.

On this trip, a Samsonite Dimension Notebook Case from SAM's replaced my aging W3C bag in my travel kit. I think I like all the little pockets and such, but I'm not sure; sometimes I miss the simplicity of one big compartment. I'm sure that I'm not happy that my Kensington K33069 Universal AC/Car/Air Adapter stopped working somewhere between MIT and MKE.

That's one of the reason that I always pack some light reading on dead-trees. I enjoyed escaping into Scott Lasser's Battle Creek on the way home. Baseball and fathers. Good stuff.

citing W3C specs from WWW conference papers

Submitted by connolly on Tue, 2006-04-25 10:19. :: | | |

As I said in a July 2000 message to www-rdf-interest:

There are very few data formats I trust... when I use when I use the computer to capture my knowledge, I pretty much stick to plain text, XML (esp XHTML, or at least HTML that tidy can turn into XHTML for me), RCS/CVS, and RFC822/MIME. I use JPG, PNG, and PDF if I must, but not for capturing knowledge for exchange, revision, etc.

And as I explained in a 1994 essay, converting from LaTeX is hard, so I try not to write in LaTeX either.

The Web conference has instructions for submitting PDF using LaTeX or MS Word and (finally!) for submitting XHTML. (The WWW2006 paper CSS stylesheet is horrible... who wants to read 9pt times on screen?!?! Anyway...) So when the IRW 2006 organizers told me they'd like a PDF version of my paper in that style, I dusted off my Transforming XHTML to LaTeX and BibTeX tools and got to work.

My paper cites a number of W3C specs, including HTML 4. The W3C tech reports index/digital library has an associated bibliography generator. I fed it http://www.w3.org/TR/html401 and it generated a nice bibliographic reference from an RDF data set. I'm interested in the ongoing citation microformats work that might make that transformation lossless, since I need not just XHTML, but BibTex. What I'm doing currently is adding some bibtex vocabulary in class and rel attributes:

<dt class="TechReport">
<a name="HTML4" id="HTML4">[HTML4]</a>
</dt>

<dd><span class="author">Le Hors, Arnaud and Raggett, Dave and
Jacobs, Ian</span> Editors,
<cite> <a
href="http://www.w3.org/TR/1999/REC-html401-19991224">HTML 4.01
Specification</a> </cite>,
<span class="institution">W3C</span> Recommendation,
24 <span class="month">December</span> <span class="year">1999</span>,
<tt class="number">http://www.w3.org/TR/1999/REC-html401-19991224</tt>.
<a href="http://www.w3.org/TR/html401" title="Latest version of
HTML 4.01 Specification">Latest version</a> available at
http://www.w3.org/TR/html401 .</dd>

When run thru my xh2bib.xsl, out comes:

@TechReport{HTML4,
title = "{
HTML 4.01 Specification
}",
    author = {Le Hors, Arnaud
and Raggett, Dave
and Jacobs, Ian},
    institution = {W3C},
    month = {December},
    year = {1999},
    number = {http://www.w3.org/TR/1999/REC-html401-19991224},
    howpublished = { \url{http://www.w3.org/TR/1999/REC-html401-19991224} }
}

I think I should be using editor = rather than author = but that didn't work the 1st time I tried and I haven't investigated further.

In any case, I'm reasonably happy with the PDF output.

Syndicate content