Provenance-Embedding Documents

Abstract

As computer and network technology improves, information will become more fluid, and the sharing and recombining of data in a decentralized fashion will become even more prevelant than it is today. Knowing the provenance and policies associated with data will be an important part of any attempts at regulation of such an environment. Models that rely on centralized servers for this data are too rigid for use on the Internet, and models that rely on separate metadata files will require extraneous transfers and accesses each time metadata is required. In an environment such as the Internet, inline metadata will become the most convenient way to express provenance information. To increase the ability of end users to create and use documents with known provenance and policy data, I present a method of embedding provenance metadata within documents, using RDFa. I present a JavaScript API for extracting this information, so utilities that use this metadata to compute interesting properties can easily be written. I present tools that will allow end users to easily create provenance-embedded documents. I present an annotated MediaWiki that produces provenance-embedded documents from its internal database. Finally, I present numerous tools for allowing end users to visualize provenance information in unique ways.

Thesis Outline

  1. Introduction
  2. Background
    1. Technologies
    2. Policy motivations
  3. Provenance-Embedded Documents
    1. Data structure embedded with RDFa
    2. JavaScript APIs for using PED structures
    3. Cryptography?
  4. Document Creation Tools
    1. Copy-paste bookmarklets
      1. Maybe a copy-paste Firefox extension?
    2. Notification of what licenses user can select
    3. Validator which checks for satisfaction of composite document
      1. Check for literal satisfaction of CC-BY and CC-NC from user-generated rulesets?
    4. MediaWiki
      1. Generates RDFa embedded pages
      2. Has extended wiki markup to ease creation of PEDs
  5. Document Viewing Tools
    1. Provenance Browser
  6. Proof of Concept: Creative Commons Rights Engine
    1. Reasoner design and implementation
    2. Application Scenarios: TAMI Scenario 10
  7. Conclusion
Italics are topics that would be nice to cover, but not essential.

Projected Schedule

3/8 Implement copy-paste bookmarklet, implement Java API
3/15 Write policy background, MediaWiki hacking
3/22 Write up ontology, write up bookmarklets, MediaWiki hacking
3/29 Write up MediaWiki
4/5 Code "provenance browser"
4/12 Code "provenance browser"
4/19 Debugging week + Write up provenance browser
4/26 Debugging week + Writing week
5/3 Writing week
5/10 Writing week
5/17 FULL DRAFT COMPLETED
5/17-5/28 Editing
MAY 28 THESIS DUE

Components

Provenance Embedding Documents
Status: Draft 1 done
Description This is the spec for the Provenance-Embedding Document. Sandro, Danny and I worked out an initial spec. This is enough to get plenty of work started, but there are still several issues to be resolved. Among them: What namespace to use, how to best integrate with PML, and what ontology to use for allowed purposes.

JavaScript API
Status: Prototyped
Description

Copy-Paste Functionality
Status: Prototyped
Description I wrote a Firefox extension to provide provenance-preserving copy.
After discussion with Sandro, we decided to also create a JavaScript bookmarklet for better cross-platform compatibility.

License Notification
Status: CC Version Implemented
Description

MediaWiki
Status: NONE
Description

Provenance Browser
Status: NONE
Description

Creative Commons Scenarios
Status: Draft 1 done
Description The Creative Commons scenarios are a use case for provenance ontology. I will write a simple reasoning engine that will compute the results of combining various Creative Commons licenses. I will then combine this with the JavaScript API to create tools for allowing users to examine Creative Commons metadata. Further, I will create use cases for these end-user tools in the form of a TAMI scenario that revolves around a professor composing a presentation with slides from various sources.

Scenario 10, part 1
Scenario 10, part 2
Scenario 10, part 3

Creative Commons bookmarklet

Harvey C Jones