student project

DIG losing the battle with spammers again

Submitted by connolly on Tue, 2009-03-10 11:56. :: |

Blog spam went out of control again; the only remedy I could find was a very big hammer: turn off the drupal comments module altogether and in doing so, unpublish all comments ever posted to this site. I suppose they're still in the database and could be published again, if we could separate them from the spam.

The drupal expertise in our group seems to have gone on to greener pastures. That prompted me to divest from my family business drupal installation and start a hosted wordpress site and makes me wonder how safe is stuff that I write here...

Any MIT students want to help this research group manage a community presence? Please get in touch.

Talking with U.T. Austin students about the Microformats, Drug Discovery, the Tabulator, and the Semantic Web

Submitted by connolly on Sat, 2006-09-16 21:36. :: | | | | | |

Working with the MIT tabulator students has been such a blast that while I was at U.T. Austin for the research library symposium, I thought I would try to recruit some undergrads there to get into it. Bob Boyer invited me to speak to his PHL313K class on why the heck they should learn logic, and Alan Cline invited me to the Dean's Scholars lunch, which I used to attend when I was at U.T.

To motivate logic in the PHL313K class, I started with their experience with HTML and blogging and explained how the Semantic Web extends the web by looking at links as logical propositions. cal screen shot I used my XML 2005 slides to talk a little bit about web history and web architecture, and then I moved into using hCalendar (and GRDDL, though I left that largely implicit) to address the personal information disaster. This was the first week or so of class and they had just started learning propositional logic, and hadn't even gotten as far as predicate calculus where atomic formulas like those in RDF show up. And none of them had heard of microformats. I promised not to talk for the full hour but then lost track of time and didn't get to the punch line, "so the computer tells you that no, you can't go to both the conference and Mom's birthday party because you can't be in two places at once" until it was time for them to head off to their next class.

One student did stay after to pose a question that is very interesting and important, if only tangentially related to the Semantic Web: with technology advancing so fast, how do you maintain balance in life?

While Boyer said that talk went well, I think I didn't do a very good job of connecting with them; or maybe they just weren't really awake; it was an 8am class after all. At the Dean's Scholars lunch, on the other hand, the students were talking to each other so loudly as they grabbed their sandwiches that Cline had to really work to get the floor to introduce me as a "local boy done good." They responded with a rousing ovation.

Elaine Rich had provided the vital clue for connecting with this audience earlier in the week. She does AI research and had seen TimBL's AAAI talk. While she didn't exactly give the talk four stars overall, she did get enough out of it to realize it would make an interesting application to add to a book that she's writing, where she's trying to give practical examples that motivate automata theory. So after I took a look at what she had written about URIs and RDF and OWL and such, she reminded me that not all the Deans Scholars are studying computer science; but many of them do biology, and I might do well to present the Semantic Web more from the perspective of that user community.

So I used TimBL's Bio-IT slides. They weren't shy when I went too fast with terms like hypertext, and there were a lot of furrowed brows for a while. But when I got to the FOAFm OMM, UMLS, SNP, Uniprot, Bipax, Patents all have some overlap with drug target ontology drug discovery diagram, I said I didn't even know some of these words and asked them which ones they knew. After a chuckle about "drug", one of them explained about SNP, i.e. single nucleotide polymorphism and another told me about OMM and the discussion really got going. I didn't make much more use of Tim's slides. One great question about integrating data about one place from lots of sources prompted me to tempt the demo gods and try the tabulator. The demo gods were not entirely kind; perhaps I should have used the released version rather than the development version. But I think I did give them a feel for it. In answer to "so what is it you're trying to do, exactly?" I gave a two part answer:

  1. Recruit some of them to work on the tabulator so that their name might be on the next paper like the SWUI06 paper, Tabulator: Exploring and Analyzing linked data on the Semantic Web.
  2. Integrate data accross applications and accross administrative boundaries all over the world, like the Web has done for documents.

We touched on the question of local and global consistency, and someone asked if you can reason about disagreement. I said that yes, I had presented a paper in Edinburgh just this May that demonstrated formally a disagreement between several parties

One of the last questions was "So what is computer science research anway?" which I answered by appeal to the DIG mission statement:

The Decentralized Information Group explores technical, institutional and public policy questions necessary to advance the development of global, decentralized information environments.

And I said how cool it is to have somebody in the TAMI project with real-world experience with the privacy act. One student followed up and asked if we have anybody with real legal background in the group, and I pointed him to Danny. He asked me afterward how to get involved, and it turned out that IRC and freenode are known to him, so the #swig channel was in our common neighborhood in cyberspace, even geography would separate us as I headed to the airport to fly home.

technorati tags:, ,

Blogged with Flock

An Introduction and a JavaScript RDF/XML Parser

Submitted by dsheets on Mon, 2006-07-17 15:02. :: | | | |

My name is David Sheets. I will be a sophomore at MIT this fall. I like to be at the intersection of theory and practice.

This summer, I am working as a student developer on the Tabulator Project in the Decentralized Information Group at MIT's CSAIL. My charge has been to develop a new RDF/XML parser in JavaScript with a view to a JavaScript RDF library. I am pleased to report that I have finished the first version of the new RDF/XML parser.

Before this release, the only available RDF/XML parser in JavaScript was Jim Ley's parser.js. This parser served the community well for quite a while but fell short of the needs of the Tabulator Project. Most notably, it didn't parse all valid RDF/XML resources.

To rectify this, work on a new parser was begun. The result that is being released today is a JavaScript class that weighs in at under 400 source lines of code and 2.8K gzip compressed (12K uncompressed). For maximum utility, a parser should be small, standards-compliant, widely portable, and fast.

To the best of my knowledge, RDFParser is fully compliant with the RDF/XML specification. The parser passes all of the positive parser test cases from the W3. This was tested using jsUnit -- a unit testing framework similar to jUnit but for JavaScript. To run the automated tests against RDFParser, you can follow the steps here. This means the parser supports features such as xml:base, xml:lang, RDF Collections, XML literals, and so forth. If it's in the specification, it should be supported. An important point to note is that this parser, due to speed concerns, is non-validating. Additionally, RDFParser has been speed optimized resulting in code that is slightly less readable.

The new parser is not as portable as the old parser at this time. It has only been tested in Firefox 1.5 but should work in any browser that supports the DOM Level 2 specification.

RDFParser runs at a speed similar to Jim Ley's parser. One can easily construct example RDF/XML files that run faster on one parser or another. I took five files that the tabulator might come across in day-to-day use and I ran head-to-head benchmarks between the two parsers.

Parse time is highly influenced by compact serialization. The more nested the RDF/XML serialization, the more scope frames must be created to track features from the specification. The less nested, the fewer steps to traverse the DOM, the more triples per DOM element.

Planned in the next release of RDFParser is a callback/continuation system so that the parser can yield in the middle of a parse run and allow other important page features to run.

API documentation for RDFParser included in the Tabulator 0.7 release is available.

Finally, I'd be happy to hear from you if you have questions, comments, or ideas regarding the RDFParser or related technologies.

Exporting databases in the Semantic Web with SPARQL, D2R, dbview, ARC, and such

Submitted by connolly on Fri, 2006-06-02 16:55. :: | | |

The developer track at WWW2006 last week in Edinburgh was really cool; you had to show up on time or you couldn't fit in the room! One of the coolest talks was D2R-Server - Publishing Relational Databases on the Web as SPARQL-Endpoints.. I see D2R Server is released now. Cool.

Yes, storing RDF in a SQL database using 3-column tables (or 4 or 5 or 6...) is cool as far as it goes, but I'm gland we're finally seeing more work on taking existing SQL databases (whose schemas are not designed with RDF in mind) and exporting them as RDF.

TimBL wrote a design note on Relational Databases on the Semantic Web in 1998. In 2002, I wrote, a couple hundred lines of python that implements parts of it. Rob Crowell picked it up and the 2005/2006 version of now does foreign keys and backlinks.

D2R gets points for using RDF for their configuration/mapping info. The slides showed turtle/n3. Why are the dbin brainlets in XML but not RDF? I wonder.

D2R Server has a mapping layer; dbview assumes that will be handled with rules. The choice of URIs for column names is interesting. D2R uses jdbc:mysql://, but dbview is all about embedding a SQL database in HTTP space, so we use URIs like http://db.example/orders/customers/custno/1#item. In dbview, the decisions about when to use / and when to use # are made so that the result is browseable. In D2R, the default URIs don't matter as much because it's expected that they'll be mapped to a more well-known ontology/schema like foaf.

dbview is still just a few hundred lines of python; we haven't integrated the SPARQL parser that Yosi developed for cwm, nor integrated EricP's work on federated query.

Speaking of federated query... on Wednesday at the conference, I saw Tim Finin in the poster session. He showed me something the swoogle folks are cooking up: you give it a SPARQL query, and it looks at the terms used in your query and suggests documents you should put in your SPARQL dataset to run your query against. I hope to hear more about that.

Somewhere in EricP's work is one of the several SPARQL-to-SQL rewriters out there... oh... I thought the HP tech report, A relational algebra for SPARQL was another one, but it seems to be by Richard Cyganiak, one of the D2R guys.

Benjamin Nowack's Feb 2006 item announced a SPARQL-to-SQL rewriter for his ARC RDF store for PHP.

Hmm... maybe it's time for a ScheduledTopicChat on SPARQL, SQL, and RDF? If you're interested, suggest a couple times that would be good for you in a comment or in mail to me and a public archive.

webizing TaskJuggler

Submitted by connolly on Fri, 2006-05-19 11:29. :: | | |

Going over my bookmarks I rediscovered TaskJuggler:

TaskJuggler provides an optimizing scheduler that computes your project time lines and resource assignments based on the project outline and the constrains that you have provided. The build-in resource balancer and consistency checker offload you from having to worry about irrelevant details and ring the alarm if the project gets out of hand.

Sound like this tool might be applicable to the hard problem of scheduling meetings with various constraints.

It seems to have a declarative project description language:

flags team

resource dev "Developers" {
  resource dev1 "Paul Smith" { rate 330.0 }
  resource dev2 "Sebastien Bono"
  resource dev3 "Klaus Mueller" { vacation 2002-02-01 - 2002-02-05 }

  flags team
resource misc "The Others" {
  resource test "Peter Murphy" { limits { dailymax 6.4h } rate 240.0 }
  resource doc "Dim Sung" { rate 280.0 }

  flags team

What might that look like in N3, i.e. in RDF that the tabulator could browse around? (See the N3 primer to get a feel for RDF and N3.) What would it take to webize the Taks Juggler?

Also... how does the taskjuggler consistency checker relate to OWL consistency checkers like pellet?

tabulator use cases: when can we meet? and PathCross

Submitted by connolly on Wed, 2006-02-08 13:48. :: | | | | |

I keep all sorts of calendar info in the web, as do my colleagues and the groups and organizations we participate in.

Suppose it was all in RDF, either directly as RDF/XML or indirectly via GRDDL as hCalendar or the like.

Wouldn't it be cool to grab a bunch of sources, and then tabulate names vs. availability on various dates?

I would probably need rules in the tabulator; Jos's work sure seems promising.

Closely related is the PathCross use case...

Suppose I'm travelling to Boston and San Francisco in the next couple months. I'd like my machine to let me know I have a FriendOfaFriend who also lives there or plans to be there.

See also the Open Group's Federated Free/Busy Challenge.

Drupal, OpenID, and the Mac OS X Keychain

Submitted by connolly on Mon, 2005-12-19 16:12. :: | | |

Managing passwords via email callback is hampered by anti-spam mechanisms. I just helped a breadcrumbs user whose password message from drupal was classified as Junk by Mac OS X Mail.

Meanwhile, I did enough research on the Mac OS X keychain to trust it. Support for OpenID in drupal is already in the OpenID wish list and I've see some progress.

It's not obvious to me how to connect the keychain to OpenID, but I'm sure there's a way. Any suggestions?

Connecting DIG Student Projects to the MIT UROP listing

Submitted by connolly on Mon, 2005-12-19 00:51. :: | | |

A couple MIT students have found their way to the #dig channel and asked about UROPs during IAP. I'm still learning about student rhythms at MIT; I was never a student here; I got my degree at U.T. Austin. My ten years with W3C has exposed me to the terms UROP and IAP before, but I have paged most of it out. Let's refresh our cache, shall we?

The Independent Activities Period (IAP) is a special four week term at MIT that runs from the first week of January until the end of the month. IAP 2006 takes place from January 9 through February 3.

IAP overview

In UROP info for supervisors, I see there's a form for listing projects. Hey... it would be cool if the student projects category here in this blog were automatically syndicated via that form. A meta-student-project?

Meanwhile, we do have a few notes on student projects among our DIG info for MIT students.

I'm not sure how items syndicated from Danny/Eric via the WordPress plug-in can get categorized; I suppose we can do it manually, after-the-fact?

I see a bunch of UROP openings for this time of year. The Building Games to Acquire Commonsense Knowledge project looks cool.

NOTE: It is expected that UROP students are supervised in the laboratory at all times, per the Institute's "no working alone" policy .

UROP safety isses

Sounds a bit like a "no coding alone" policy that I've been pushing around W3C and DIG, since discovering the value of pair programming, or a variant of it.

Toward richtext syndicated feed

Submitted by connolly on Wed, 2005-12-14 12:25. :: | | |

Our RSS feed is plaintext, so when it's syndicated in Planet RDF and the like, there are no links or pictures or even paragraph breaks.

From #swig discussion, I gather that the state-of-the-art is to use nasty escaped markup, but I'm not up for that. The RDF Core WG didn't spend 18 months getting the details of parseType="Literal" right for nothing, did we?

I don't know if there are drupal modules available that Do The Right Thing, and due to my PHP angst I don't really want to know. But maybe there's a motivated student out there... ?

GRDDL transform wanted: National Information Exchange Model (NIEM)

Submitted by connolly on Thu, 2005-12-01 11:50. :: |

Via Karen in the TAMI project, I gather last month the Department of Justice and the Department of Homeland Security announced the result of an XML collaboration - version 0.1 of the National Information Exchange Model (NIEM) which will be used for law enforcement, emergency management, etc. communities and the parties who exchange information with them.

I hope to check it out. Better yet... I hope somebody else checks it out and writes a GRDDL transformation.

Syndicate content