Hal Abelson (MIT), Adam Barth (Stanford), Tim Berners-Lee (MIT), Joan Feigenbaum (Yale), Chris Hanson (MIT), Jim Hendler (Maryland), Aaron Johnson (Yale), Lalana Kagal (MIT), Carlos Delgado Kloos (Universidad Carlos III de Madrid (visiting MIT)), Ora Lassila (Nokia), Harvey Jones (MIT), Eddan Katz (Yale), Michael J. May (Penn), Deborah McGuinness (Stanford), Helen Nissenbaum (NYU), Richard Murphy (GSA), Joe Pato (HP), Calvin Powers (IBM), Felipe Saint-Jean (Yale), Vitaly Shmatikov (Texas), Avi Silberschatz (Yale), Gerry Sussman (MIT), Poorvi Vora (George Washington), Daniel Weitzner (MIT), Rebecca Wright (Stevens)
Twenty-five researchers met for two days to discuss a range of issues relating to privacy in online environments and means of promoting accountability to rules and policies. From these discussions, three basic research challenges emerged:
We summarize the questions here and the provide summaries of the talks presented by workshop participants as well as reports from the breakout discussions that occurred following the individual presentations.
Increasing use of computers and networks in business, government, recreation, and almost all aspects of daily life has led to a proliferation of "sensitive data," i.e., electronic data records that, if used improperly, can harm the data subjects, data owners or data users or contravene other public-policy interests. As a result, concern about the ownership, control, privacy, and accuracy of these data has become a top priority. Traditional security and privacy research (including but not limited to crypto-theory research) attempts to alleviate this concern by inhibiting the transmission of sensitive data. Encryption, access control, and privacy-preserving, distributed function evaluation are well studied techniques that exemplify this approach. However, there are important reasons to believe that this approach is inadequate. We briefly explain two of them here.
First, many sensitive data items are simply "facts" about people and organizations that are (1) legitimately used by a large number of organizations and (2) fairly easily acquired by someone determined to do so. Although it may be reasonable to expect that an x-ray of one's shoulder remain in the records system of the radiology lab that created it and be seen only by medical-service providers who need to see it, it is not reasonable to expect that one's name and address will remain confined to a small number of isolated systems. Even if 99% of the organizations who acquire this succinctly representable, sensitive information use it appropriately, the negligent 1% could cause a deluge of unwanted communication; furthermore, a determined adversary can acquire this information with a modest amount of targeted effort.
Second, there is a general argument in favor of controlling use of sensitive information instead of controlling its dissemination. Determining whether a particular use is appropriate may require a great deal of information that is only available in context, i.e., at the time and place of the proposed use. While there are many differences between privacy and copyright laws, copyright provides a good example of the advantage of protecting use rather than access or collection. Draconian DRM systems that prevent access to such works can prohibit fair use, which is legal under copyright law, in an attempt to inhibit infringement, which is illegal. At the same time, determining whether a proposed use is allowed by the fair-use doctrine requires a great deal of contextual knowledge. Reasoning by analogy with the End-to-End Arguments of networking, one is led to the general pattern of handling sensitive data in a manner makes it available to all who may have a legitimate use but at the same time requires that potential users prove that they have a right to use it before doing so or at least be held accountable for what they have used it for.
Today's Internet architecture does not provide adequate support for accountability for two reasons. First, it is based on network addresses, not names. The binding of names to addresses is neither secure nor verifiable, and the same is true of the binding of transmitted data objects to addresses. Consequently, high-level authentication, authorization, and accountability mechanisms that are based on names (as they inevitably will be, because application semantics are expressed in terms of names, not network addresses) can be completely subverted by network-level attacks: denial of service, IP spoofing, DNS spoofing, etc. Could a different network architecture better support "accountability infrastructure." How can one create network resources with universally understood, secure, persistent, verifiable names? What are the minimal sets of network resources upon which one can build proof systems for authorized use of sensitive data in a broad range of applications? Are there other network-architectural approaches to enabling accountability? Second, enabling accountability to rules that describe permissible and impermissible uses of informations requires identification of data and responsible parties at a different (higher) level of abstraction than is provided by the network layer. So even if we securely bind names to addresses, we must then bind names to authorized agents and classes of agents and describe the relationship between agents and classes of data.
Many of the accountability approaches discussed in this workshop will depend on creation of secure logs representing information access and usage events, as well as the ability to bind policies (also known as 'sticky policies') to information resources. While audit logs are nothing new in IT systems, there are special challenges in recording and verifying transactions in large scale, decentralized systems. At the same time, there are aspects of this problem that has been worked on extensively, e.g., in the watermarking and DRM context, as well as sticky policies in data base systems. As we may not be able to solve the verifiable provenance problem in its full generality, it will be necessary to find tractable special cases of the problem.
People claim that they care about privacy, and yet they store a great deal of sensitive data on commercial websites such as GMail. The apparent gap between our larger concern about privacy and individual people's behavior raises a number of hard questions:
In the final analysis, how should the researchers and system designers balance a desire to build systems with guaranteed levels of privacy along with the more general but neutral requirement to make systems that can be responsive to privacy needs expressed by society through law, social practice and individual behavior? We do not pretend to reach consensus on these questions but recognize them as important to consider in shaping research directions.
Summary: The PORTIA project focuses on "sensitive information." Why the term "sensitive" and not "private"? Because "privacy" is problematic as a term. To crypto people, it implies "secrecy." "Private" to crypto/security types is about confidential information, stuff that the world shouldn't be able to access at all. Today, the challenge isn't keeping information totally secret but rather assuring that information is used appropriately. PORTIA goals are better tools, thinking about data lifetime, and development of conceptual frameworks.
Or the outside perspective: How can you be sure that the organization you're interacting with is respecting its policy? If you discover that it is not, what recourse do you have? Appropriate use may appear implicitly. There is a recent, helpful article by Dan Solove, "Taxonomy of Privacy" in which he argues that courts are loathe to find privacy rights in non-secret information. Consider a company that gets in formation from you during a transaction. It may not be secret information, but if the company sells it to a data broker (e.g., ChoicePoint), that could be considered inappropriate. Implicit in your understand of of the rules governing this data exchange is that your information won't be used for purposes that are unrelated to the transaction or to the company's business.
[Barth slides] [Nissenbaum slides][Paper]
Additional discussion: Elements of Contextual Integrity (CI) are:
Abstract:There is an urgent need for transparency and accountability in modern data-mining applications. Attempts to address issues of personal privacy in a world of computerized databases and information networks -- from security technology to data-protection regulation to Fourth Amendment law jurisprudence -- typically proceed from the perspective of preventing or controlling *access* to information. We argue that this perspective has become inadequate and obsolete, overtaken by the effortlessness of sharing and copying data, and the ease of aggregating and searching across multiple data bases to reveal private information from public sources. To replace the access limitation framework, we propose that issues of privacy protection currently viewed in terms of data access be re-conceptualized in terms of data use. From a technology perspective, this requires supplementing legal and technical mechanisms for access control with new mechanisms for transparency and accountability of data use. We are seeking to design an information architecture for the Web that can provide transparent access to the reasoning steps taken in the course of data mining, and establish accountability for use of personal information as measured by compliance with rules governing data usage.
Most other privacy regulatory systems, are coming under stress. The question is whether any frameworks that give individuals control over how information is used is how to deal with the explosion of choices that people have to make to protect their information. We have too many choices to make here. Given that a key problem with data mining is the unexpected use of information for purposes other than those for which it was collected, the lynchpin of privacy frameworks in the US and EU would have to be purpose limitations. When information is collected we tag it with some original purpose. EU says once it's been collected for that purpose, it can't be used for another. This is a good theory, but information networks are interesting because they can cross correlate information. When presented with a choice, the temptation for a user is to just say "Yes" to a broad range of uses, thus rendering the purpose limitation useful.
We need to be at the use end of the regulatory spectrum, concentrating on enforcement and limitations of those uses. Why? It will get us to focus on the values that we can about. Lots of privacy laws are surrogates for values, but don't always express the ones that we care about. Sometimes you can only really make a decision based on use. The motivations of the actors become obvious at the use end.
The Web is moving from being read only to read/write. Content production is shifting from centralized to decentralized. This is visible in shifting from copyright paradigms, from up front DRM-style protection, to post hoc rights description frameworks such as Creative Commons. Enforcement and accountability are shifting to the use end of the spectrum.
Summary: Creative Commons is about how information is used. One might think of this as DRM without any enforcement. But this is wrong. Instead it's about notice, not DRM. Look at the source. A firefox extension supports notice of what you can and can't do with some work. Notice over restriction still gives you power. People ask how this links with digital restriction. It's really just about attaching policy to data, with enforcement happening after the fact.
Consider this example of data usage from TAMI: Joe travels on a plane and the TSA gets his information. TSA passes his information to the FBI because he's a possible match. FBI which discovers that there is a deadbeat dad warrant out on him and arrests him. But this was against the rules since the information is only allowed to be used for the purpose of terrorism protection. The system tells that the arrest was unjustified since the data was collected for a different purpose. A long log of how the data is maintained in Truth Maintenance System (TMS) format.
-------------- Lunch --------------
Summary: In the talk, we briefly present the approach taken in the TAMI project in formalizing the legal concepts and rules of some case studies in the Semantic Web rules language N3. We also make some reflections about the role of abstraction in modelling, and conclude that the need to evolve ontologies over time requires appropriate theories, methods and tools to cope with this evolution. The concept of context might help in this from a practical point of view.
Summary: When expressing proofs about compliance in systems that encode legal relationships, we want people to believe answers given to them by systems. The Inference Web infrastructure gives access to sources and inferences used to reach conclusions. Proof Markup Language (PML) provides an interlingua for exchanging proofs. We see two approaches to privacy: tight binding at time of information transmission, or one more like Creative Commons. This relies on semi-honest participation. The working assumption on the web is for loose binding. If people are out to be deceptive, or obscure origin, we have a much harder problem.
Abstract: We describe the motivations for, and development of, a rule-based policy management system that can be deployed in the open and distributed milieu of the World Wide Web. We discuss the necessary features of such a system in creating a "Policy Aware" infrastructure for the Web, and argue for the necessity of such infrastructure. We then show how the integration of a Semantic Web rules language (N3) with a theorem prover designed for the Web (Cwm) makes it is possible to use the Hypertext Transport Protocol (HTTP) to provide a scalable mechanism for the exchange of rules and, eventually proofs, for access control on the Web.
Summary: This project looked closely at the Privacy Act of 1974 as part of a larger open source project, (OSERA) whose goal is to provide an open source architecture for the Federal government. The GSA Policy Engine architecture proposes an interaction model for service providers and consumers that embodies the social contract of John Locke. The interaction model serves as a foundation for the collection, use, maintenance, and dissemination of private information based on defined policies and preferences. The US Privacy Act ontology provides for inferring allowed disclosures based on disclosure target, authorization, and intent. The privacy ontology also provides for inferenceing on denied requests and disclosure accounting. The privacy ontology is available under the OSERA open source agreement.
Abstract: We present an information-theoretic model of a voting system, consisting of (a) definitions of the desirable qualities of integrity, privacy and verifiability, and (b) quantitative measures of how close a system is to being perfect with respect to each of the qualities. We describe the well-known trade-off between integrity and privacy in this model, and defines a concept of weak privacy, which is traded off with system verifiability.
Summary: I am not here to present my own work. Cryptography is being reinvigorated because of security breach notification laws, e.g., SB~1386. How do I add encryption to legacy systems while minimizing disruption? What is the best encryption I can do in a DB field and not change the size of the field? Key management is a large challenge: cheaper. Costs ~$8 to issue a public key to someone.
Sticky policy paradigm: Different laws cover different part of data repository. HIPAA applies to some columns. State laws often affect row subsets. Lots of overlap between policies. Policy doesn't travel with data. Leads to privacy violations. Often not a case of maliciousness. Right hand didn't know what the left hand was doing. Try not to use any more bits.
- Do you presume policies are consistent? We're talking about keeping track of the requirements, not how to resolve conflicts. This ought to be doable at transaction-time or at design-time. If model and check at design-time, don't have to modify object code. Unless policy depends on content? Static analysis.
Abstract: We investigate whether it is possible to encrypt a database and then give it away in such a form that users can still access it, but only in a restricted way. In contrast to conventional privacy mechanisms that aim to prevent any access to individual records, we aim to restrict the set of queries that can be feasibly evaluated on the encrypted database.
We start with a simple form of database obfuscation which makes database records indistinguishable from lookup functions. The only feasible operation on an obfuscated record is to look up some attribute Y by supplying the value of another attribute X that appears in the same record (i.e., someone who does not know X cannot feasibly retrieve Y). We then (i) generalize our construction to conjunctions of equality tests on any attributes of the database, and (ii) achieve a new property we call group privacy. This property ensures that it is easy to retrieve individual records or small subsets of records from the encrypted database by identifying them precisely, but "mass harvesting" queries matching a large number of records are computationally infeasible.
Discussion: What is the "airline problem"?
The "airline motivation" refers to a special case of the problem. It is not feasible for the typical customer of a commercial airline to keep secret which flights he takes: Someone who is determined to know where and when this person flies can obtain the information, if necessary, simply by watching him until he goes to an airport. This is not a reason to conclude that commercial airlines should simply turn over their passenger databases to anyone who has a legitimate reason to obtain the itinerary of one specific passenger. Obtaining one itinerary costs something, perhaps the small amount of time and effort needed to get a warrant and perhaps the large amount of time and effort needed for surveillance of the passenger in question. Is it feasible to make obtaining the entire database commensurately more expensive? Will this provide useful protection against improper uses of personal information?
Summary: The basic goal of our work is to allow data mining without bringing all the data together. You want to compute some function of the data allowing that to be computed and nothing else. Tools: crypto, perturbation, randomization. The real goal is to enable collaboration that otherwise could not be done because of rules or uneasy for sharing the data. There is a tradeoff between inefficiency/inaccuracy/privacy loss. Randomization moves in the plane inaccuracy/privacy; Crypto can achieve complete privacy and accuracy with a high efficiency cost. Secure Multiparty Computation: General porpoise theory but inefficient. Part of the work is in to generating specific porpoise protocols that can solve the problem with a lower cost.
Privacy-Preserving clustering. Two party computation for k-center clustering.
Open Questions: Policies or enforcement on what functions are allowed to be computed. How can be defined how the computing of a function of a function release some info. Data cleaning. How can policy enforcement can be proved? What happened with the data?. How to define a set of policies for a sequence of functions to be evaluated. In client server mode a client can query a server and learn only the data that the policy allows and the server learns nothing about the policy or the data.
We need lighter weight crypto protocols that tend to be less expensive.
How to provide the public users proof or give confidence that all the crypto is doing the right thing and being compliant.
Abstract: There is a growing interest in establishing rules to regulate the privacy of citizens in the treatment of sensitive personal data such as medical and financial records. Such rules must be respected by software used in these sectors. The regulatory statements are somewhat informal and must be interpreted carefully in the software interface to private data. This paper describes techniques to formalize regulatory privacy rules and how to exploit this formalization to analyze the rules automatically. Our formalism, which we call privacy APIs, is an extension of access control matrix operations to include (1) operations for notification and logging and (2) constructs that ease the mapping between legal and formal language. We validate the expressive power of privacy APIs by encoding the 2000 and 2003 HIPAA consent rules in our system. This formalization is then encoded into Promela and we validate the usefulness of the formalism by using the SPIN model checker to verify properties that distinguish the two versions of HIPAA.
Provenance is a first class citizen - thus can be queried, can have provenance, can be part of the application logic
Provenance is one form of metadata
Bad guy - (PORTIA)
Insight: sometimes the audit trail is more sensitive than the data itself (since one could deduce more from the sources) may need higher access control for meta data
Legal: legal settings may require additional provenance information (such as extra geographic information for jurisdiction, what you are trying to prove, etc.)
(Partial access needs be supported)
Asides: Provenance has fuzzy boundaries. Possibly the provenance is a special subset of meta data
Researchers in the Portia community (Nissenbaum and Barth) have concentrated on describing and instrumenting systems according to flow/transmission rules. The TAMI project, by contrast, has worked to describe and reason over usage rules. In discussing the relationship between flow (transmission) rules and usage rules, we agreed:
1. Transmission rules can be expressed as usage rules without doing violence to the T rules.
2. Audit metadata is vital. Open questions:
3. need more discussion of spectrum of design/expressive options from transmission-rule enforcement to adverse action (ultimate use) rules.
4. how will these systems adapt to changing rules/laws.
5. We should work to develop a design for end-to-end accountability. This can be a neutral principle for policy-aware systems.
Scenario 1 (Amazon)
Scenario 2 (Consumer)
Scenario 3 (Business)
Feigenbaum and Weitzner's work organizing the workshop and preparing this summary was funded by NSF Portia Project and NSF TAMI project.