Provenance Retrieval (What does my data depend on?)
Data Validation (Is the data sound?)
Metadata Retrieval (How did my data get produced?)
Problem: We Can't Hold Distributed Data Accountable
How can Brian hold his academic sources accountable?
Problem: We Might Not Be Allowed to Know Everything
How can Gina validate the projections made by Consolidated Power and Light, Inc.?
Problem: We Can't Ask Questions about Prior Manipulations of Data
How can Ricardo identify the data that depended on a call to BadSort()?
How can we solve these problems?
Localize the provenance and data!
Distributed
Localizing means there's a larger global scale.
Each node may only see a small portion of everything known.
Retrieval System
The problem is retrieving data to work on it.
Use "social networking" to solve the larger problem.
With Provenance
Difficult to carry around the entire provenance tree.
We don't want everything to be in the tree. (Zero-Knowledge Proofs)
Solution:
Don't carry the provenance along with you.
Let the network do the storage for you.
Propagation
Data propagation model used as the mechanism for sharing data.
Building framework for generic propagator programming.
Brian Can Ignore Outdated Work
He just looks at the provenance tree to find which papers have an ancestor that is outdated.
There's Room for Black Boxes
Gina can trace the provenance tree and verify the operations used to generate Consolidated Power and Light, Inc.'s financial figures, even though she doesn't have access to the whole thing.
Ricardo Can Find Bad Data AND Let Others Know About BadFunc()!
He can ask the producers of his data about whether they use BadFunc() so he can remove data that depends on BadFunc() as well as let his producers know they need to fix their program.
Distributed Provenance Summary
Natural model of information flow
Scales to larger systems
Handles zero-knowledge proofs without needing to distribute them