Map and Territory in RDF APIs

Submitted by connolly on Tue, 2010-04-27 14:30. :: |

RDF specs and APIs have made a bit of a mess out of a couple pretty basic tools of math and computing: graphs and logic formulas. With the RDF next steps workshop coming up and Pat Hayes re-thinking RDF semantics Sandro thinking out loud about RDF2, I'd like us to think about RDF in more traditional terms. The scala programming language seems to be an interesting framework to explore how they relate to RDF.

The Feb 1999 RDF spec wasn't very clear about the map and the territory. It said that statements are made out of parts in the territory, rather than features on the map, which doesn't make very much sense. RDF APIs seem to inherit this confusion; e.g. from an RDF::Value class for ruby:

Examples:

Checking if a value is a resource (blank node or URI reference)

value.resource

Blank nodes and URI references are parts of the map; resources are in the territory.

Likewise in Package org.jrdf.graph:

Resource A resource stands for either a Blank Node or a URI Reference.

The 2004 RDF specs take great pains to clarify these use/mention distinctions, but they also go on at great length.

Let's review Wikipedia on graphs:

In mathematics, a graph is an abstract representation of a set of objects where some pairs of the objects are connected by links. ...  The edges may be directed (asymmetric) or undirected (symmetric) ... and the edges are called directed edges or arcs; ... graphs which have labeled edges are called edge-labeled graphs.


With that in mind, in the swap-scala project, we summarize the RDF abstract syntax as an edge-labelled directed graph with just one or two wrinkles:

package org.w3.swap.rdf

trait RDFGraphParts {
  type Arc = (SubjectNode, Label, Node)

  type Node
  type Literal <: Node
  type SubjectNode <: Node
  type BlankNode <: SubjectNode
  type Label <: SubjectNode
}

The wrinkles are:

  • Arcs can only start from BlankNodes or Labels, i.e. SubjectNodes
  • Arcs labels may also appear as Nodes

We use another trait to relate concrete datatypes to these abstract types:

trait RDFNodeBuilder extends RDFGraphParts {
def uri(i: String): Label
type LanguageTag = Symbol
def plain(s: String, lang: Option[LanguageTag]): Literal
def typed(s: String, dt: String): Literal
def xmllit(content: scala.xml.NodeSeq): Literal
}

This doesn't pin down what a Label is, but in any concrete implementation, you can build one from a String using the uri method. The RDFNodeBuilder trait is used to implement RDF/XML, RDFa, and turtle parsers that are agnostic to the concrete implementation of an RDF graph.

Now let's look at terms of first order logic:

 The set of terms is inductively defined by the following rules:

  1. Variables. Any variable is a term.
  2. Functions. Any expression f(t1,...,tn) of n arguments (where each argument ti is a term and f is a function symbol of valence n) is a term.
This is represented straightforwardly in scala a la:
package org.w3.swap.logic1
/**
* A Term is either a Variable or an FunctionTerm.
*/
sealed abstract class Term { ... }

class Variable extends Term { ...}

abstract class FunctionTerm() extends Term {
def fun: Any
def args: List[Term]
}

The core RDF doesn't cover all of first order logic; it corresponds fairly closely to the conjunctive query fragment:

The conjunctive queries are simply the fragment of first-order logic given by the set of formulae that can be constructed from atomic formulae using conjunction \wedge and existential quantification \exists, but not using disjunction \lor, negation \neg, or universal quantification \forall.

We can then excerpt just the relevant parts of the definition of formulas:

The set of formulas is inductively defined by the following rules:

  1. Predicate symbols. If P is an n-ary predicate symbol and t1, ..., tn are terms then P(t1,...,tn) is a formula.
  2. Binary connectives. If φ and ψ are formulas, then (φ \rightarrow ψ) is a formula. Similar rules apply to other binary logical connectives.
  3. Quantifiers. If φ is a formula and x is a variable, then \forall x \varphi and \exists x \varphi are formulas.
Our scala representation follows straightforwardly:
package org.w3.swap.logic1ec 

sealed abstract class ECFormula
case class Exists(vars: Set[Variable], g: And) extends ECFormula
sealed abstract class Ground extends ECFormula
case class And(fmlas: Seq[Atomic]) extends Ground
case class Atomic(rel: Symbol, args: List[Term]) extends Ground

Now that we have scala representations for RDF graphs and conjunctive query formulas, how do we relate them? This is the fun part:

package org.w3.swap.rdflogic

import swap.rdf.RDFNodeBuilder
import swap.logic1.{Term, FunctionTerm, Variable}
import swap.logic1ec.{Exists, And, Atomic, ECProver, ECFormula}

/**
* RDF has only ground, 0-ary function terms.
*/
abstract class Ground extends FunctionTerm {
override def fun = this
override def args = Nil
}

case class Name(n: String) extends Ground
case class Plain(s: String, lang: Option[Symbol]) extends Ground
case class Data(lex: String, dt: Name) extends Ground
case class XMLLit(content: scala.xml.NodeSeq) extends Ground


/**
* Implement RDF Nodes (except BlankNode) using FOL function terms
*/
trait TermNode extends RDFNodeBuilder {
type Node = Term
type SubjectNode = Term
type Label = Name

def uri(i: String) = Name(i)

type Literal = Term
def plain(s: String, lang: Option[Symbol]) = Plain(s, lang)
def typed(s: String, dt: String): Literal = Data(s, Name(dt))
def xmllit(e: scala.xml.NodeSeq): Literal = XMLLit(e)
}

The abstract RDFGraphBuilder node types are implemented as first order logic terms. For formulas, we use a "holds" predicate:

 object RDFLogic extends ... {
def atom(s: Term, p: Term, o: Term): Atomic = {
Atomic('holds, List(s, p, o))
}
def atom(arc: (Term, Term, Term)): Atomic = {
Atomic('holds, List(arc._1, arc._2, arc._3))
}
}

Then all the semantic machinery up to simple entailment between RDF graphs just falls out of conjunctive query.

I haven't done RDFS Entailment yet; the plan is to do basic rules first (N3rules or RIF BLD) and then use that for RDFS, OWL2-RL, and the like.

 

 

Existentials in ACL2 and Milawa make sense; how about level breakers?

Submitted by connolly on Wed, 2010-01-20 18:20. :: |

Since my Sep 2006 visit to the ACL 2 seminar, I've been trying to get my head around existentials in ACL2. The lightbulb finally went off this week while reading Jared's Dec 2009 Milawa thesis.

3.7 Provability

Now that we have a proof checker, we can use existential quantification to
decide whether a particular formula is provable. Recall from page 61 the notion
of a witnessing (Skolem) function.
We begin by introducing a witnessing function,
logic.provable-witness, whose defining axiom is as follows.


Definition 92: logic.provable-witness
(por* (pequal* ...))

Intuitively, this axiom can be understood as: if there exists an appeal which is
a valid proof of x, then (logic.provable-witness x axioms thms atbl) is such
an appeal.

Ding! Now I get it.

This witnessing stuff is published in other ACL publications, noteably:

  • Structured Theory Development for a Mechanized Logic, M. Kaufmann and J Moore, Journal of Automated Reasoning 26, no. 2 (2001), pp. 161-203.

But I can't fit those in my tiny brain.

Thanks, Jared, for explaining it at my speed!

Here's hoping I can turn this new knowledge into code that converts N3Rules to ACL2 and/or Milawa's format. N3Rules covers RDF, RDFs, and, I think, OWL2-RL and some parts of RIF. Roughly what stuff FuXi covers.

I'm somewhat hopeful that the rest of N3 is just quoting. That's the intuition that got me looking into ACL2 and Milawa again after working on some TAG stuff using N3Logic to encode ABLP logic. Last time I tried turning N3 {} terms in to lisp quote expressions was when looking at IKL as a semantic framework for N3. I didn't like the results that time; I'm not sure why I expect it to be different this time, but somehow I do...

Another question that's keeping me up at night lately: is there a way to fit level-breakers such as log:uri (or name and denotation, if not wtr from KIF) in the Milawa architecture somehow?

 

DIG losing the battle with spammers again

Submitted by connolly on Tue, 2009-03-10 11:56. :: |

Blog spam went out of control again; the only remedy I could find was a very big hammer: turn off the drupal comments module altogether and in doing so, unpublish all comments ever posted to this site. I suppose they're still in the database and could be published again, if we could separate them from the spam.

The drupal expertise in our group seems to have gone on to greener pastures. That prompted me to divest from my family business drupal installation and start a hosted wordpress site and makes me wonder how safe is stuff that I write here...

Any MIT students want to help this research group manage a community presence? Please get in touch.

<span style="display: none">No such thing as bad publicity for Facebook</span>

Submitted by Danny Weitzner on Tue, 2009-02-17 21:03. ::

The original appearance of this entry was in Danny Weitzner - Open Internet Policy

Anecdotal evidence suggests that there’s no such thing as bad publicity (at least for Facebook). In the wake of the recent flap about Facebook’s change in its terms of service, I seem to be experiencing a spike in new friend requests on Facebook. Of course, there may be no causal relationship whatsoever but I don’t think I’ve become any nicer or more popular. :-) I have a feeling people just have Facebook on the brain.

<span style="display: none">Obama&#8217;s Tech Stimulus plan - Health IT, Broadband, and smart grid</span>

Submitted by Danny Weitzner on Mon, 2009-01-26 11:03. ::

The original appearance of this entry was in Danny Weitzner - Open Internet Policy

Steve Lohr has a nice piece in the New York Times (’Technology Gets a Piece of Stimulus,’ 26 Jan 2009, p. C1) this morning about the role that technology and innovation will play in the economic recovery (aka stimulus) bill supported by the Obama Administration.

In the past, health IT deployment has been approached as an engineering problem: what computers have to be part of which networks exchanging which types of data? This loses sight of the purpose of electronic medical records: helping doctors to provide better care to their patients and transforming the system at a macro scale so that it enables data-driven, evidence-based research on how to provide effective, cost-efficient care. Today, because most doctors are paid based on how many procedures they perform, as opposed to how good they are at keeping patients healthy, will actually lose money if new information systems help them to deliver care more efficiently and keep people healthier. So, the key challenge for electronic medical record deployment is to marry up overall changes in healthcare policy with the the right innovation environment to produce the health information infrastructure we need to support safer, more efficient health care.

A quick infusion of stimulus spending, combined with a long term commitment to spend much of this money in a way that rewards doctors for delivering better care and data needed to measure effectiveness and efficiency (as opposed to just subsidizing them to put expensive hardware and software on their desks), can help lay the groundwork for the systems needed for health care reform. As Lohr explains:

The time-tested way for governments to create jobs in a hurry is to pour money into old-fashioned public works projects like roads and bridges. President Obama’s economic recovery plan will do that, but it also has some ambitious 21st century twists.

The $825 billion stimulus plan presented this month by House Democrats called for $37 billion in spending in three high-tech areas: $20 billion to computerize medical records, $11 billion to create smarter electrical grids and $6 billion to expand high-speed Internet access in rural and underserved communities.
[..]
The technology industry is not typically viewed as a prolific job producer. Much of its manufacturing is highly automated. But bringing technology to services fields like health care, telecommunications and energy can be labor intensive and thus generate jobs.

The issues surrounding electronic health records illustrate the policy challenges of targeted programs. Mr. Obama has advocated spending $50 billion over five years to accelerate the use of such records and the sharing of health information across a national network.
[..]
The computerized records, when used properly, are an indispensable tool for measuring, tracking and improving patient care — yet only about 17 percent of the nation’s doctors are using them. They are commonplace at large medical groups, but 75 percent of doctors practice in small offices of 10 physicians or fewer.

Doctors often benefit from inefficiency, because the dominant fee-for-service payment system means they are paid for doing more — more doctor visits, tests, surgical procedures, pills.

“Paying to put computer hardware and software in physicians’ offices isn’t going to do anything unless you change the incentives in the system,” said Dr. David J. Brailer, former national health information technology coordinator in the Bush administration.
[..]
“You want to pay for achievement — better health quality and efficiency,” said Dr. David Blumenthal, director of the Institute for Health Policy at the Harvard Medical School, who advised the Obama campaign. “But in the transition period, before financial incentives are reformed, you need to provide incentives or grants to use electronic health records because this technology is sort of the opening wedge to reform.”

And summarizes the current contents of HealthIT stimulus proposals developed by the transition team and current being considered by Congress:

Those eligible for grants to buy technology, a member of the Obama transition team said, will include inner-city and rural hospitals and small doctor practices. But most money, he said, will go to incentive payments to improve quality and safety of care.

The big leverage that that the Federal Government has is the over $700 Billion dollars that it spends on Medicare and Medicaid each year. All together the Federal government pays for over 40% of all healthcare in the US so directing that spending in a way that encourages a more data-driven health care system is the key to success. The stimulus spending will be the first step toward creating a system in which that money can be used to encourage smart, data-driven health care.

<span style="display: none">Transitioned</span>

Submitted by Danny Weitzner on Fri, 2009-01-23 16:02. ::

The original appearance of this entry was in Danny Weitzner - Open Internet Policy

I’ve spent the last eleven weeks working on the Obama-Biden Transition Project with the Technology Innovation and Government Reform (a.k.a TIGR) policy group and have now finished. It’s been an great experience and tremendous honor to be able to work on a wide range of technology policy issues with such a talented, disciplined and dedicated group of people. Knowing that President Obama is now in the White House and getting to participate in the inaugural festivities was a great way to cap this all off.

What was extraordinary about the Technology and Innovation policy group was that it existed at all. This presidential transition, as others in the past, had to do a thorough review of issues and challenges in all of the Federal agencies, select senior personnel to fill Cabinet and White House positions, and prepare strategies for meeting key policy challenges and campaign commitments: health care reform, economic recovery, national security, foreign policy, etc. While Internet technology and innovation issues have been on the radar screen for nearly 20 years, this was that first Presidential campaign and the first Presidential Transition Team to give tech policy issues high profile attention.

Now I’m going to take a few days off, try to catch up on old email, and look forward to returning to my research and teaching at MIT at the beginning of February.

OpenID "Hello World" on apache still deep magic

Submitted by connolly on Thu, 2009-01-08 18:37. ::

I have a home movie that I just want to show to just a few friends around the Web. With OpenID, I should be able to just give my web server a list of my friends' pages, right?

I eventually found a README for mpopenid with just what I wanted:

PythonOption authorized-users "http://alice.com/ http://bob.com/"

But that wasn't on the top page of hits on a search for "apache OpenID". (Like most sites, mine runs on apache.) The top hit is mod_auth_openid, but its FAQ that says my use case isn't directly supported:

Is it possible to limit login to some users, like htaccess/htpasswd does?
No. ... If you want to restrict to specific users that span multiple identity providers, then OpenID probably isn't the authentication method you want. Note that you can always do whatever vetting you want using the REMOTE_USER CGI environment variable after a user authenticates.

So I installed the prerequisites for mpopenid: libapache2-mod-python and python-elementtree were straightforward, but I struggled to find a version of python-openid that matched. I almost gave up at that point, but heartened by somebody else who got mpopenid working, I went back to searching and found a launchpad development version of mpopenid. That seems to work with python-openid-1.1.0.

In /etc/apache2/sites-available/mysite, I have this bit that glues mpopenid's login page into my site:

<Location "/openid-test-aux">
SetHandler mod_python
PythonOption action-path "/openid-test-aux"
PythonHandler mpopenid::openid
</Location>

And in mysite/movies/.htaccess, this bit says only I get to see http://mysite.example/sekret:

<Files "sekret">
PythonAccessHandler mpopenid::protect
PythonOption authorized-users "http://www.w3.org/People/Connolly/"
</Files>

The mpopenid README also shows an option to put the list of pages in a separate file:

PythonOption authorized-users-list-url file:///my/directory/allowed-users.txt

But I haven't tried that yet. So far I'm happy to put the list right in the .htaccess file.

<span style="display: none">President-Elect Obama&#8217;s electronic medical records goal</span>

Submitted by Danny Weitzner on Thu, 2009-01-08 14:14. ::

The original appearance of this entry was in Danny Weitzner - Open Internet Policy

From Remarks of President-Elect Barack Obama
As Prepared for Delivery
American Recovery and Reinvestment
Thursday, January 8, 2009

[..]
“To improve the quality of our health care while lowering its cost, we will make the immediate investments necessary to ensure that within five years, all of America’s medical records are computerized. This will cut waste, eliminate red tape, and reduce the need to repeat expensive medical tests. But it just won’t save billions of dollars and thousands of jobs – it will save lives by reducing the deadly but preventable medical errors that pervade our health care system.
[..]

<span style="display: none">The paradox of information flow in transition</span>

Submitted by Danny Weitzner on Tue, 2008-11-11 16:37. ::

The original appearance of this entry was in Danny Weitzner - Open Internet Policy

A wonderfully perceptive and funny characterization from outgoing US Democratic National Committee Chair Howard Dean (Health care contenders - Chris Frates - Politico.com):

Dean said: “I’m not going to say anything about anything to do with transition. Generally, those who talk don’t know, and those who know don’t talk. And I don’t know what he’s [President-elect] going to do, but I ain’t talking.”

<span style="display: none">First legal shot across the Semantic Web&#8217;s bow - Thomson suing Zotero</span>

Submitted by Danny Weitzner on Mon, 2008-10-06 14:16. ::

The original appearance of this entry was in Danny Weitzner - Open Internet Policy

Last week Thomson Reuters (the owner of EndNote Software, a widely used proprietary tool for collecting and managing scholarly bibliographic information) filed a lawsuit against Zotero, the most popular open source, Semantic Web-enabled bibliographic tool. Zotero, packaged as a Firefox extension, is a handy tool for collecting bibliographic metadata to assist scholars in managing information necessary for their research (news story, complaint). Zotero can import and export a variety of different bibliographic formats and does so in a web-friendly, RDF-enabled way. Exchanging and linking bibliographic information (ie., the title, author, publication venue) of scholarly communication is an important means to discover new links amongst individual pieces of research that are published around the world. This has been a high priority, for example, in the life sciences where new knowledge can be uncovered by linking individual pieces of research together.

The latest beta release of Zotero will read and write EndNote’s proprietary metadata format and import and export the citation formats that EndNote provides for a wide variety of academic journals. In response to this, Thomson sued the Zotero developers (an open source community hosted at George Mason University), charging that Zotero (and GMU) reverse engineered the EndNote citation file format in violation of EndNote’s end user license agreement (EULA).

The key effect of Thomson’s suit, if it succeeds, would be to create a legal doctrine that enables software developers to restrict the Semantic Web’s potential to promote data interoperability and data integration. The legal issue at bar has to do with reverse engineering and the enforceability of EULAs, both of which are important questions. And, there’s a lot of say about whether or not the compliant will stand up to legal scrutiny. That said, the Web community, as well as the scholarly community, ought to pay careful attention to this case because its outcome could have real bearing on how free we will all be in the future to exchange information and realize the knowledge-enhancing benefits of the Web through collaborative research.

Update: Nature Magazine editorializes about the threats to interoperability of the lawsuit.