Semantic Web 2009 - Research Papers Reviews of submission #194: "Provenance of Inferences: A Trail of Proofs in Linked Open Data" ------------------------ Submission 194, Review 1 ------------------------ Reviewer: external Overall Rating Must reject: It is important to reject this paper Originality Poor Technical Soundness Poor (some major issues) Presentation Horrible: Needs shepherding (help) Expertise 2 (Passing Knowledge) Summarize the Scientific Contribution The paper describes a software prototype that crawls linked open data, applies N3 rules to the data, and uses SPARQL Update to publish the entailments and "proof" data. Summarize your review The paper should be rejected because it does not offer a novel scientific contribution, it is poorly organized, it contains many grammatical errors that significantly reduce readability, and does not adequately address related work in the field. The Review The paper offered a description of a prototype system that crawls linked open data and computes entailments based on N3 rules. The use of "real world" examples throughout the paper made the paper accessible to the average reader, but the lack of formal technical content was disappointing. The paper describes a system, but does a poor job of explaining the motivation for the system. The paper systematically ignored how issues of data quality impact trustworthiness. E.g., in the first paragraph of section 2 the argue that attaching a triple has "importance based on the number of proofs supporting the same derived fact." But they make no mention of the veracity of data that constitutes the proof. The paper was poorly organized. E.g., the motivation and use cases sections contained overlapping content, and neither contained the expected content. It is unclear why 2.3 "Desired Features" appeared in the motivation section. Section 3.1 and Section 2.1 had considerable overlap. The paper did not address a great deal of related work, examples of which are included in the comments to authors. Comments to Authors The paper provided a brief system description but did a poor job of making clear what the novel contribution of this work was. If the paper is submitted elsewhere, you should the contribution of this work. The paper should be revised to be more careful with the use of terms like trustworthiness and proof. It conflates the issues of justification / proof and veracity and should be more specific about which problem it is trying to address. The presentation of "linked data garbage collection" did not appear to be complete. The programming language garbage collection analogy was not helpful. In particular, what in your delete scenario corresponds to an in scope variable. I.e., if un-justified triples are deleted from the data web and you haven't described a way to distinguish between ground facts and entailments, then either a cycle must exist or all data will be deleted. This paper contained numerous grammatical errors that would have been identified by a careful reading by a native English speaker. These types of errors make reading the paper difficult and put a significant extra burden on the reviewer. W3 DAWG activity standardizing SPARQL extensions including update should be referenced and its relevance discussed. Justification of logical entailment has a rich literature that was ignored by the paper. Consider, as a starting point, the ISWC 2008 best paper, "Laconic and Precise Justifications in OWL." Reference 5 appears to have the incorrect title. It does not match the link and does match reference 3. ------------------------ Submission 194, Review 2 ------------------------ Reviewer: external Overall Rating Should reject: I argue for rejecting this paper. Originality Poor Technical Soundness Poor (some major issues) Presentation Horrible: Needs shepherding (help) Expertise 3 (Knowledgeable) Summarize the Scientific Contribution The main contribution of this paper is that it proposes to generate links from the linked data cloud and preserve the provenance trails of those links so that these computations become permanent and at the same time consistent through a garbage collection process. Summarize your review The paper tackles an interesting aspect of the maintenance of the linked data cloud. However, many of the aspects dealt with are dealt with in a naive way, which does not consider adequately and in a principled manner the complexity of dealing with such a wealth of distributed knowledge. The Review As described above, I consider that the problem tackled by the authors is a very relevant problem for the linked data community, in the sense that it may be an important alternative to make links that can be generated across distributed RDF sources / SPARQL endpoints, by means of rules or whatever else method, permanent, so that they can be later exploited more easily, while still maintining the consistency of the obtained results when the original results change. However, my main concern is that the authors only refer slightly to the large body of work that was done two decades ago in the development of truth maintenance systems, and the review of this work and the lessons learned from this work should be reviewed more carefully. For instnace, the garbage-collection-based proposal presented in section 2.1 seems to be too naive to solve many of the problems that may appear, especially when taking into account that the update of distributed sources may be done in parallel by different data owners. This is not the only case where the proposals seem rather simple and do not go into the necessary details. For instance, the statement that just by using rdf:label the data will be multilingual is quite simple and hides much of the complexity that may be related to the creation of real multilingual links among data sources. Besides, there is no clear description of how rules are executed in this environment. The authors rather concentrate on issues related to content negotiation, and management of the HTPP protocol in different situations, rather than describign how a distributed rule engine may work. It seems that the work will be done in a centralised way instead. I am not against this (in the current Web there are examples of systems that work on a centralised fashion and perform very well, e.g., Google) but there should be a clear description of how the rules are encoded, how they are executed across sources, etc. Finally, there is a lack of a proper evaluation of the proposed solution, focusing on either theoretical advantages and disadvantages or experimental ones. Comments to Authors ------------------------ Submission 194, Review 3 ------------------------ Reviewer: external Overall Rating I do not care what happens to this paper Originality Fair Technical Soundness Excellent (flawless) Presentation Passable: Comparable to the usual Expertise 4 (Expert) Summarize the Scientific Contribution This is mostly a vision statement and a draft of how it could be implemented. We should be able to reason on the Semantic Web e.g. collecting facts around and applying rules and write back to our server the results of this reasoning for later reuse Summarize your review I give great credits to the author for wanting to provide a vision statement and i would encourage them to in fact show this as a demo, argue this around and engage in discussions. I do not believe this vision paper is polished and convincing enough, to make it to the main track at this point (it is more workshop material). The Review What described in the paper can and has been implemented in fact. The way this is proposed does in fact make technical sense. My problem with this paper is that the idea of having a "ever running crawler" and a completely distributed system is, in the light of what internet has turned into in the last.. many years unrealistically romantic. I would love to say "but the paper provides compelling examples, well worked out use cases involving real users with empirical evidence that in fact an ever crawling deamon is is sustinable and provides benefits" but this is not the case. Also if an original crawler was able to get those proofs using deterministic algorithms, somebody else might be able to do the same if really interested at some later point, so the mechanism becomes a caching scheme, is this the most efficient way to do it? to be discusssed. In absence of a discussion which compares with other models, I am left with my original idea: it wont scale, it would be possible to do all this in a much more industrial centralized way (still preserving distributed data origins), its advantages are feeble.. which leaves this paper to be still a good vision statement, something to discuss etc (e.g. at an appropriate workshop) but i dont think it is "complete" and convincing enough for the ISWC main track. But it really depends on the other submissions as well, so i wont argue actively for rejection. Comments to Authors the ideas are obviously interesting and to be debated in some venue.