Transparency for behavioral profiling
The original appearance of this entry was in Danny Weitzner - Open Internet Policy
Behavioral targeting is pervasive on the Web. As documented by a very nicely-researched New York Time story today (’To Aim Ads, Web Is Keeping Closer Eye on You,’ NYT, by Louise Story, 10 March 2008.) it’s now clear that each of us who use popular search engines and portals are the subject of thousands of individual data collection events per month of Web usage.
I’m glad to see some clear analysis of the practice out there but would like to see an additional level of transparency. If it is the case that profiling is benign, then why not tell uses what aspect of their profile triggered the placement of a particular ad. The ad delivery systems all make decisions about which ads to place for a given user from some properties of that user that are either known or inferred. Why not just tell us what those properties are along with the add placement. This would go a long way toward eliminating the feeling that we’re being ’spied on’ because it would eliminate any sense of secrecy about what is learned in the course of the behavioral monitoring. My guess is that many people would ignore the profile data, but some would check it, and we’d all have piece of mind from knowing that whatever is being done is happening out in the open.
According to the Times, data is collected on which web pages we look at and is then combined with other data (demographics, browsing history, purchases on partner sites, etc.). Right on cue traditional privacy advocates declare that profiles developed in this way (based on our behavior) do (or should) make us feel uneasy:
“When you start to get into the details, it’s scarier than you might suspect,” said Marc Rotenberg, executive director of the Electronic Privacy Information Center, a privacy rights group. “We’re recording preferences, hopes, worries and fears.”
No doubt people (as least some people) feel alarmed about this and probably others are either implicitly or explicitly happy to have the right ads targeted to them. As an online ad agency exec said in the article:
“Everyone feels that if we can get more data, we could put ads in front of people who are interested in them,” he said. “That’s the whole idea here: put dog food ads in front of people who have dogs.”
Unless were going to require an outright ban on this sort of behavioral targeting, the question what to do about it. Is the goal to allay people’s fears? To limit the use of the profiles? Or to help people avoid incorrect targeting?
The statistics developed by comScore for the New York Times article do a nice job of illustrating the magnitude of data collection that happens. Jules Polonetsky, AOL’s Chief Privacy Officer, is launching a new consumer education campaign to explain the mechanics of data collection and tracking to users. The light that both the Times stories and the AOL campaign shed on marketing practices is valuable.
Many people are going to far more interested in how this profiling actually effects them, than on the overall magnitude of the practice. Is there any reason not to be upfront with people about the basis for delivering an ad? If there is, then there is reason to feel that we’re being deceived or maniplated, not assisted, by the behavior tracking techniques.
sidekick calendar subscription for SXSW
At a conference, like in a good coding session, it's too easy to lose track of time, so I rely heavily on a PDA to remind me of appointments. The SXSW program has just the features I want:
- an "add this to my calendar" button next to each session
- a calendar feed of my choices
But I carry a hiptop, which doesn't support calendar subscription. I could copy-and-paste a few critical sessions to my hiptop, but when the climbing geeks offer an hCalendar feed, it becomes wortwhile to use iCal on the laptop, i.e. something that groks calendar subscription, as the master calendar device.
I have had a system for exporting my mobile calendar as a feed, but it's a tedious 4 step shell command sequence; it's OK once or twice a week, but here at SXSW, I want to sync up several times a day.
I have been moving my palmagent project from shell commands and Makefiles to a RESTful Web service, and this pushed me over the edge to add calendar feed support.
As usual, to pull the data from the hiptop's data servers:
- Make a directory to hold hiptop accounts and put it in hip_config.py:
AccountsDir = "/Users/connolly/Desktop/danger-accts"
- Start hipwsgi.py running:
pbjam:~/projects/palmagent$ python hipwsgi.py &
Serving HTTP on 0.0.0.0 port 8080 ... - Use dangerSync.py to log in and get some session credentials for half an hou of use:
~/Desktop/danger-accts/ACCT $ python ~/projects/palmagent/dangerSync.py \
--prod --user ACCT \
--passwd YOUR_PASSWORD_HERE \
>session-id - Visit http://0.0.0.0:8080/pim/ACCT and hit the Pull button.
Now you have event, task, contact, and note directories containing a JSON file for each record and hipwsgi.py lets you navigate them in a few different ways.
The pull feature is incremental; it grabs just the records that have changed since you previously pulled:
Pull majo from danger hiptop service
back to sync options
event
anchor: 1204914757247
The new feature today is the ical export, linked from the event categories page:
event
back to sync options
You can copy the address of that ical export link and subscribe to it from iCal, and bingo, there it is, merged with the SXSW calendar and such.
@@screenshot pending
hAudio for microformats mixtapes, in progress
I was visiting a friend and I wanted to play Back When I Could Fly and the easiest way was to burn a CD and put it in their CD player and while I was at it I figured I might as well pick a few other songs... a sort of mixtape to say thanks for letting me crash there.
That sort of artifact is too precious to leave locked up in iTunes's proprietary format, even if it is XML; as I said in a July 2000 message:
There are very few data formats I trust... when I use
the computer to capture my knowledge, I pretty
much stick to plain text, [X]HTML, and email. I use JPG, PNG, and PDF if I must,
but not for capturing knowledge for exchange, revision, etc.
So I wrote itunekb.py, which reads the iTunes data, picks out one playlist, and writes it out in hAudio format using a genshi template. The result is ordinary HTML at one level:
- Poems, Prayers And Promises by John Denver
4:06 from A Song's Best Friend: The Very Best Of John Denver [Disc 1] (2004)- Did You Feel The Mountains Tremble by Delirious?
4:42 from WOW Worship: Orange (Disc 1) (2000)- The Reason by Hoobastank
3:52 from The Reason (2003)- Back When I Could Fly by Trout Fishing In America
3:29 from Family Music Party (1998)- ...
At another level, it's yummy Semantic Web data.
Oops! Well, it used to be; but hAudio seems to be changing:
- 00:16, 12 Feb 2008 ManuSporny (Replaced FN with TITLE per the microformats-new mailing list discussion)
Here's hoping I find time to catch up.
The political power of (simple) Web computing
The original appearance of this entry was in Danny Weitzner - Open Internet Policy
It’s pretty amazing what a little bit of structured computer power can do when deployed on the Web. Slate’s Delegate Calculator puts in the hands of Web-enabled citizens some simple computing power that helps us to understand how the delegate counts in the upcoming Democratic primaries may effect the final outcome for Obama and Clinton over the next hours, weeks and months. The knowledge about which states have how many delegates, how they might be apportioned, etc., is information that used to be a closely guarded secret of the political intelligencia and the press. How, it’s out there for all of us to see. It’s such a useful tool that many reporters from other publications are actually writing about it:
Jonathan Alter, Hillary’s Math Problem, Newsweek (4 March 2008)
Peter Baker, Clinton Down, but not Out, for the Count, Washington Post.
Jason Tuohey, Delegate Counter, Boston Globe
Carol Lockhead, Obama Wins Vermont, But Look at the Math, San Francisco Chronicle.
Granted, Slate has a relationship with some of those new outlets, but it’s still striking to see computing make the political news.
Important FCC hearing on Net Neutrality in Cambridge, MA
The original appearance of this entry was in Danny Weitzner - Open Internet Policy
I’d encourage anyone in or around the Boston, MA area to come to the Federal Communications Commission’s field hearing on Broadband Network Management Practices. I’ll be testifying along with a range of witnesses, Dave Clark and David Reed (colleagues from MIT), representatives from various commercial groups, and a number of advocacy organizations such as Free Press. I understand Congressman Ed Markey, a longtime champion of the Internet and the Web, will also be appearing.
Here are the logistical details:
Monday, Feb 25, 2008
11:00 a.m. to 4:00 p.m.
Harvard Law School, Ames Courtroom, Austin Hall
1515 Massachusetts Avenue, Cambridge, Mass.
Accountability Appliances: What Lawyers Expect to See - Part III (User Interface)
I've written in the last two blogs about how lawyers operate in a very structured enviroment. This will have a tremendous impact on what they'll consider acceptable in a user interface. They might accept something which seems a bit like an outline or a form, but years of experience tell me that they will rail at anything code-like.
For example, we see
:MList a rdf:List
and automatically read
"MList" is the name of a list written in rdf
Or,
air:pattern {
:MEMBER air:in :MEMBERLIST.
and know that we are asking our system to look for a pattern in the data in which a particular "member" is in a particular list of members. Perhaps because law is already learning to read, speak, and think in another language, most lawyers look at lines like those above and see no meaning.
Our current work-in-progress produces output that includes:
bjb reject bs non compliant with S9Policy 1
Because
phone record 2892 category HealthInformation
Justify
bs request instruction bs request content
type Request
bs request content intended beneficiary customer351
type Benefit Action Instruction
customer351 location MA
xphone record 2892 about customer351
Nearly every output item is a hotlink to something which provides definition, explanation, or derivation. Much of it is in "Tabulator", the cool tool that aggregates just the bits of data we want to know.
From a user-interface-for-lawyers perspective, this version of output is an improvement over our earlier ones because it removes a lot of things programmers do to solve computation challenges. It removes colons and semi-colons from places they're not commonly used in English (i.e., as the beginning of a term) and mostly uses words that are known in the general population. It also parses "humpbacks" - the programmers' traditional
concatenation of a string of words - back into separate words. And, it replaces hyphens and underlines - also used for concatenation - with blank spaces.
At last week's meeting, we talked about the possibility of generating output which simulates short English sentences. These might be stilted but would be most easily read by lawyers. Here's my first attempt at the top-level template:
Issue: Whether the transactions in [TransactionLogFilePopularName] {about [VariableName] [VariableValue]} comply with [MasterPolicyPopularName]?
Rule: To be compliant, [SubPolicyPopularName] of [MasterPolicyPopularName] requires [PatternVariableName] of an event to be [PatternValue1].
Fact: In transaction [TransactionNumber] [PatternVariableName] of the event was [PatternValue2].
Analysis: [PatternValue2] is not [PatternValue].
Conclusion: The transactions appear to be non-compliant with [SubPolicyName] of [MasterPolicyPopularName].
This seems to me approximately correct in the context of requests for the appliance to reason over millions of transactions with many sub-rules. A person seeking an answer from the system would create the Issue question. The Issue question is almost always going to ask whether some series of transactions violated a super-rule and often will have a scope limiter (e.g., in regards to a particular person or within a date scope or by one entity), denoted here by {}.
From the lawyer perspective, the interesting part of the result is the finding of non-compliance or possible non-compliance. So, the remainder of the output would be generated to describe only the failure(s) in a pattern-matching for one or more sub-rules. If there's more than one violation, the interface would display the Issue once and then the Rule to Conclusion steps for each non-compliant result.
I tried this out on a laywer I know. He insisted it was unintelligible when the []'s were left in but said it was manageable when he saw the same text without them.
For our Scenario 9, Transaction 15, an idealized top level display would say:
Issue: Whether the transactions in Xphone's Customer Service Log about Person Bob Same comply with MA Disability Discrimination Law?
Rule: To be compliant, Denial of Service Rule of MA Disability Discrimination Law requires reason of an event to be other than disability.
Fact: In transaction Xphone Record 2892 reason of the event was Infectious Disease.
Analysis: Infectious disease is not other than disability.
Conclusion: The transactions appear to be non-compliant with Denial of Service Rule of MA Disability Discrimination Law.
Each one of the bound values should have a hotlink to a Tabulator display that provides background or details.
Right now, we might be able to produce:
Issue: Whether the transactions in Xphone's Customer Service Log about Betty JB reject Bob Same comply with MA Disability Discrimination Law?
Rule: To be non-compliant, Denial of Service Rule of MA Disability Discrimination Law requires REASON of an event to be category Health Information.
Fact: In transaction Xphone Record 2892 REASON of the event was category Health Information.
Analysis: category Health Information is category Health Information.
Conclusion: The transactions appear to be non-compliant with Denial of Service Rule of MA Disability Discrimination Law.
This example highlights a few challenges.
1) It's possible that only failures of policies containing comparative matches (e.g., :v1 sameAs :v2; :v9 greaterThan :v3; :v12 withinDateRange :v4) are legally relevant. This needs more thought.
2) We'd need to name every sub-policy or have a default called UnnamedSubPolicy.
3) We'd need to be able to translate statute numbers to popular names and have a default instruction to include the statute number when no popular name exists.
4) We'd need some taxonomies (e.g., infectious disease is a sub-class of disability).
5) In a perfect world, we'd have some way to trigger a couple alternative displays. For example, it would be nice to be able to trigger one of two rule structures: either one that says a rule requires a match or one that says a rules requires a non-match. The reason for this is that if we always have to use the same structure, about half of the outputs will be very stilted and cause the lawyers to struggle to understand.
6) We need someway to deal with something the system can't reason. If the law requires the reason to be disability and the system doesn't know whether health information is the same as or different from disability, then it ought to be able to produce an analysis that says something along the lines of "The relationship between Health Information and disability is unknown" and produce a conclusion that says "Whether the transaction is compliant is unknown." If we're reasoning over millions of transactions there are likely to be quite a few of these and they ought to be presented after the non-compliant ones.
Accountability Appliances: What Lawyers Expect to See - Part II (Structure)
Building accountability appliances involves a challenging intersection between business, law, and technology. In my first blog about how to satisfy the legal portion of the triad, I explained that - conceptually - the lawyer would want to know whether particular digital transactions had complied with one or more rules. Lawyers, used to having things their own way, want more... they want to get the answer to that question in a particular structure.
All legal cases are decided using the same structure. As first year law students, we spend a year with highlighter in hand, trying to pick out the pieces of that structure from within the torrent of words of court decisions. Over time, we become proficient and -- like the child who stops moving his lips when he reads -- the activity becomes internalized and instinctive. From then on, we only notice that something's not right by its absence.
The structure is as follows:
- ISSUE - the legal question that is being answered. Most typically it begins with the word "whether" "Whether the Privacy Act was violated?" Though the bigger question is whether an entire law was violated, because laws tend to have so many subparts and variables, we often frame a much narrower issue based upon a subpart that we think was violated, such as "Whether the computer matching prohibition of the Privacy Act was violated?"
- RULE - provides the words and the source of the legal requirement. This can be the statement of a particular law, such as "The US Copyright law permits unauthorized use of copyrighted work based upon four conditions - the nature of use, the the nature of the work, the amount of the work used, and the likely impact on the value of the work. 17 USC § 107." Or, it can be a rule created by a court to explain how the law is implemented in practical situations: "In our jurisdiction, there is no infringement of a copyrighted work when the original is distributed widely for free because there is no diminution of market value. Field v. Google, Inc., 412 F. Supp 2d. 1106 (D.Nev. 2006)." [Note: The explanation of the citation formats for the sources has filled books and blogs. Here's a good brief explanation from Cornell.]
- FACTS - the known or asserted facts that are relevant to the rule we are considering and the source of the information. In a Privacy Act computer matching case, there will be assertions like "the defendant's CIO admitted in deposition that he matched the deadbeat dads list against the welfare list and if there were matches he was to divert the benefits to the custodial parent." In a copyright case fair use case, a statement of facts might include "plaintiff admitted that he has posted the material on his website and has no limitations on access or copying the work."
- ANALYSIS - is where the facts are pattern-matched to the rule. "The rule does not permit US persons to lose benefits based upon computer matched data unless certain conditions are met. Our facts show that many people lost their welfare benefits after the deadbeat data was matched to the welfare rolls without any of the other conditions being met." Or "There can be no finding of copyright infringement where the original work was so widely distributed for free that it had no market value. Our facts show that Twinky Co. posted its original material on the web on its own site and every other site where it could gain access without any attempt to control copying or access."
- CONCLUSION - whether a violation has or has not occurred. "The computer matching provision of the Privacy Act was violated." or "The copyright was not infringed.
In light of this structure, we've been working on parsing the tremendous volume of words into their bare essentials so that they can be stored and computed to determine whether certain uses of data occurred in compliance with law. Most of our examples have focused on privacy.
Today, the number of sub-rules, elements of rules, and facts are often so voluminous that there is not enough time for a lawyer or team of lawyers to work through them all. So, the lawyer guesses what's likely to be a problem and works from there; the more experienced or talented the lawyer, the more likely that the guess leads to a productive result. Conversely, this likely means that many violations are never discovered. One of the great benefits of our proposed accountability appliance is that it could quickly reason over a massive volume of sub-rules, elements, and facts to identify the transactiions that appear to violate a rule or for which there's insufficient information to make a determination.
Although we haven't discussed it, I think there also will be a benefit to be derived from all of the reasoning that concludes that activities were compliant. I'm going to try to think of some high value examples.
Two additional blogs are coming:
Physically, what does the lawyer expect to see? At the simplest level, lawyers are expecting to see things in terms they recognize and without unfamiliar distractions; even the presence of things like curly brackets or metatags will cause most to insist that the output is unreadable. Because there is so much information, visualization tools present opportunities for presentations that will be intuitively understood.
And:
The 1st Lawyer to Programmer/Programmer to Lawyer Dictionary! Compliance, auditing, privacy, and a host of other topics now have lawyers and system developers interacting regularly. As we've worked on DIG, I've noticed how the same words (e.g., rules, binding, fact) have different meanings.
Accountability Appliances: What Lawyers Expect to See - Part I
Just before the holidays, Tim suggested I blog about "what lawyers expect to see" in the context of our accountability appliances projects. Unfortunately, being half-lawyer, my first response is that maddening answer of all lawyers - "it depends." And, worse, my second answer is - "it depends upon what you mean by 'see'". Having had a couple of weeks to let this percolate, I think I can offer some useful answers.
Conceptually, what does the lawyer expect to see? The practice of law has a fundamental dichotomy. The law is a world of intense structure -- the minutae of sub-sub-sub-parts of legal code, the precise tracking of precedents through hundreds of years of court decisions, and so on. But, the lawyers valued most highly are not those who are most structured. Instead, it is those who are most creative at manipulating the structure -- conjuring compelling arguments for extending a concept or reading existing law with just enough of a different light to convince others that something unexpected supersedes something expected. In our discussions, we have concluded that an accountability appliance we build now should address the former and not the latter.
For example, a lawyer could ask our accountability appliance if a single sub-rule had been complied with: "Whether the federal Centers for Disease Control was allowed to pass John Doe's medical history from its Epidemic Investigations Case Records system to a private hospital under the Privacy Act Routine Use rules for that system?" Or, he could ask a question which requires reasoning over many rules. Asking "Whether the NSA's data mining of telephone records is compliant with the Privacy Act?" would require reasoning over the nearly thirty sub-rules contained within the Privacy Act and would be a significant technical accomplishment. Huge numbers of hours are spent to answer these sorts of questions and the automation of the more linear analysis would make it possible to audit vastly higher numbers of transactions and to do so in a consistent manner.
If the accountability appliance determined that a particular use was non-compliant, the lawyer could not ask the system to find a plausible exception somewhere in all of law. That would require reasoning, prioritizing, and de-conflicting over possibly millions of rules -- presenting challenges from transcribing all the rules into process-able structure and creating reasoning technology that can efficiently process such a volume. Perhaps the biggest challenge, though, is the ability to analogize. The great lawyer draws from everything he's ever seen or heard about to assimilate into the new situation to his client's benefit. I believe that some of the greatest potential of the semantic web is in the ability to make comparisons -- I've been thinking about a "what's it like?" engine -- but this sort of conceptual analogizing seems still a ways in the future.
Stay tuned for two additional blogs:
Structurally, what does the lawyer expect to see? The common law (used in the UK, most of its former colonies, including the US federal system, and most of US states) follows a standard structure for communicating. Whether a lawyer is writing a motion or a judge is writing a decision, there is a structure embedded within all of the verbiage. Each well-formed discussion includes five parts: issue, rule, fact, analysis, and conclusion.
Physically, what does the lawyer expect to see? At the simplest level, lawyers are expecting to see things in terms they recognize and without unfamiliar distractions; even the presence of things like curly brackets or metatags will cause most to insist that the output is unreadable. Because there is so much information, visualization tools present opportunities for presentations that will be intuitively understood.
And:
The 1st Lawyer to Programmer/Programmer to Lawyer Dictionary! Compliance, auditing, privacy, and a host of other topics now have lawyers and system developers interacting regularly. As we've worked on DIG, I've noticed how the same words (e.g., rules, binding, fact) have different meanings.
Reciprocal Privacy for the Social Web (a.k.a. FOAF)
The original appearance of this entry was in Danny Weitzner - Open Internet Policy
I’ve loved the idea of FOAF for a long time but always been bothered by the privacy risks that would result of FOAF really took off as a way to represent our social networks. Here’s an idea about how to address privacy in open social networks such as those represented by FOAF-like data structures.
It’s called (for now) REP: Reciprocal Privacy for Social Networks
ReP is a proposal to establish a reasonable privacy balance in social networking environment. Today, more and more social networks are coming onto the Web and are working to share more data across the previously-established boundaries that have previously separate these networks. Participants in social networks should have the benefit of widely shared agreements about how the information they present in those networks will be analyzed and used. To encourage the development of these social and legal privacy norms, we need a simple policy language for expressing rules associated with personal information, and a reliable, scalable mechanism for assessing accountability with those rules. We propose a new protocol by which those who share personal information on the Web can have increased confidence that this information will be used in a transparent manner and that users of the personal information will be able to be held accountable to comply with the stated usage rules.
Privacy policies and associated technologies must provide individuals harmed by breaches with legal recourse against those who abuse the norms of information usage. Hence, agreements must be clear and structured in a manner that there is a chance that the existing legal system could provide a remedy for harm. We should neither expect nor require than a single set of norms will be adequate for all users, all social networking contexts or all cultures, but there should be a common framework and a basic policy vocabulary that can express commonly used rules and be easily extended.
The key to sharing personal information across a diversity of privacy policy frameworks is to establish legal and technical mechanisms that ensures a baseline of social and legal accountability across varying rulesets. Participants in the ReP web must agree as a condition of accessing anyone else’s personal information that usage of personal information will be reported by the user to a log specified by the data subject. Further, anyone who uses the personal information must agree to require that the same set of rules (both the logging requirement and whatever usage rules came with the data) be applied to any subsequent users of the data. The log will allow the data subject to check that a specific usage of personal information complies with the specified usage limitations, and to follow the trail of accountability from the initial access of the data through to the final usage event.
This copy-left-inspired viral policy is the most effective way to assure that the original rules associated with personal data are respected as that data is re-used over and over again in a variety of contexts. In the event of misuse, the logs will provide a means to locate the mis-user and seek correction or other redress. In the event that a use of personal information is discovered which is NOT recorded in the person’s accountability log, that use is by definition a violation of the ReP policy. In many cases where such unauthorized use does real harm to the data subject, it will be possible with some amount of forensic effort will find the mis-user and enable redress. Of course, there will be anonymous mis-users of personal information. We cannot insulate Web users from those risks with ReP, but neither can any other privacy protection strategy that is feasible in an inherently open information environment.
There’s more to read in a skeletal REP design document.
The policy is still rough and the technology hasn’t been built yet, but I’d still really like reactions. ![]()
I can only imagine...
I have a new bookmark. No, not a del.icio.us bookmark; not some bits in a file. This is the kind you have to go there to get.. go to Cleveland, that is. It reads:
Thank you
for you love & support
for the Ikpia & Ogbuji families
At this time of real need.
We will never forget
Imose, Chica, & Anya
![]()
Abundant Life International Church
Highland heights, OH
After working with Chime for a year or so on the GRDDL Working Group (he was the difference between a hodge-podge of test files and a nicely organized GRDDL Test Cases technical report), I was really excited to meet him at the W3C Technical Plenary in Cambridge in early November. His Fuxi work is one of the best implementations of the way I think semantic web rules and proofs should go. When he told me some people didn't see the practical applications, it made me want to fly there and tell them how I think this will save lives and end world hunger.
So this past Tuesday, when I read the news about his family, the only way I could make my peace with it was to go and be with him. I can only imagine what he is going through. Eric Miller and Brian and David drove me to the funeral, but the line to say hi to the family was too long. And the internment service didn't really provide an opportunity to talk. So I was really glad that after I filled my plate at the reception, a seat across from Chime and Roschelle opened up for me and I got to sit and share a meal with them.
Grandpa Linus was at the table, too. His eulogy earlier at the funeral ended with the most inspiring spoken rendition of a song that I have ever heard:
Now The Hacienda's Dark The Town Is Sleeping
Now The Time Has Come To Part The Time For Weeping
Vaya Con Dios My Darling
Vaya Con Dios My Love
His eulogy is also posted as Grandparents' lament on The Kingdom Kids web site, along with details about a fund to help the family get back on their feet.

