The Case for Distributed Provenance Propagation

Ian Jacobi

5 November 2009

Outline

Distributed Provenance is the Key to Scalability

We are here.

A Home Office
Bibi / CC BY-NC-SA 2.0
to
A Computer Lab
shinyai / CC BY-NC 2.0
to
A Data Center
Mathieu Ramage / CC BY 2.0

Distributed Provenance Use Cases

Problem: We Can't Hold Distributed Data Accountable

CERN just made one of Gell-Mann's papers obsolete.  How can Brian find the papers that depend on this information?

Problem: We Might Not Be Allowed to Know Everything

If the power company can't reveal the underlying bills for privacy reasons, how can Gina be convinced that the projected earnings are correct?

Problem: We Can't Ask Questions about Prior Manipulations of Data

Ricardo learns that some library call badFunc() is known to give the wrong result when passed certain kinds of data, but he doesn't know if his data sources used it, and what, if any data, needs to be recalculated.  How can Ricardo find out what data is affected and which sources need to be told?

How can we solve these problems?

Localize the provenance and data!

Distributed

Brian's paper genealogy doesn't have to know everything about the papers...

Retrieval System

Ricardo doesn't know how Number Muncher and Number Cruncher generated any of his data, but if he asked the people who maintain them...

With Provenance

Gina can query Consolidated Power and Light, Inc. for where the projections came from.

Propagation

Each datum acts as a cell to which remote cells can subscribe.  Each store also acts as a cell that lists the cells known.  By subscribing in such a way, we can be notified when data changes.

Brian Can Ignore Outdated Work

Brian can reconstruct the provenance tree by looking at where he got the papers from.

There's Room for Black Boxes

Consolidated Power and Light can rely on zero-knowledge proofs, for example, to prove parts of the projections.

Ricardo Can Find Bad Data AND Let Others Know About BadFunc()!

Ricardo can find out which data depends on Number Muncher and which depends on Number Cruncher...  Ask them if they depend on badFunc()... And ask them to fix their programs!

Distributed Provenance Summary

Redistribution License

Creative Commons License