blogs

The details of data in documents; GRDDL, profiles, and HTML5

Submitted by connolly on Fri, 2008-08-22 14:09. :: | |
GRDDL, a mechanism for putting RDF data in XML/XHTML documents, is specified mostly at the XPath data model level. Some GRDDL software goes beyond XML and supports HTML as she are spoke, aka tag soup. HTML 5 is intended to standardize the connection between tag soup and XPath. The tidy use case for GRDDL anticipates that using HTML 5 concrete syntax rather than XHTML 1.x concrete syntax involves no changes at the XPath level

But in GRDDL and HTML5, Ian Hickson, editor of HTML 5, advocates dropping the profile attribute of the HTML head element in favor of rel="profile" or some such. I dropped by the #microformats channel to think out loud about this stuff, and Tantek said similarly, "we may solve this with rel="profile" anyway." The rel-profile topic in the microformats wiki shows the idea goes pretty far back.

Possibilities I see include:
  • GRDDL implementors add support for rel="profile" along with HTML 5 concrete syntax
  • GRDDL implementors don't change their code, so people who want to use GRDDL with HTML 5 features such as <video> stick to XML-wf-happy HTML 5 syntax and they use the head/@profile attribute anyway, despite what the HTML 5 spec says.
  • People who want to use GRDDL stick to XHTML 1.x.
  • People who want to put data in their HTML documents use RDFa.

I don't particularly care for the rel="profile" design, but one should choose ones battles and I'm not inclined to choose this one. I'm content for the market to choose.

 

Microsoft on the need for openness in scholarly tools and data

Submitted by Danny Weitzner on Mon, 2008-07-28 11:43. ::

The original appearance of this entry was in Danny Weitzner - Open Internet Policy

I’m sitting at the Microsoft Faculty Summit, listening to Tony Hey (VP for External Research) talk about how critical it is for scientific researchers to have open access to data and open source tools (Tony actually said ‘free software tools’) in order to solve the most critical problems of the world. Among other things, Tony highlighted the importance of attaching metadata to documents and data, mentioning some of the MSFT tools such as a MS-Office plug-in that attaches Creative Commons labels to Office docs.

Anyone who thinks that institutional views of monolithic or easy to predict….

Conflicting voices in the liberal mainstream on FISA

Submitted by Danny Weitzner on Wed, 2008-07-09 15:30. ::

The original appearance of this entry was in Danny Weitzner - Open Internet Policy

Today, the Senate passed a much-debated revision to the Foreign Intelligence Surveillance Act with Lots of different views out there, even amongst the mainstream liberal establishment on the upcoming FISA legislation (’Senate Passes Surveillance Bill With Immunity for Telecom Firms‘, Washington Post, William Branigin, 9 July 2008).

In advance of this vote, there has been much debate, recently because Sen. Obama announced that he would support this compromise bill and not vote in support of filibuster. (Full disclosure, I’m an Obama supporter and have helped the campaign on Internet policy issues.) In thinking about this, I thought I’d survey the range of opinion just on the liberal center. Here’s some of what I found:

Mort Halperin, highly regarded civil libertarian, former head of the ACLU Washington office, and himself a target of unwarranted government wiretapping when he was working for Henry Kissenger in the Nixon White House, writes in a New York Times Op-Ed (’Listening to Compromise‘, New York Times, 8 July 2008):

The compromise legislation that will come to the Senate floor this week is not the legislation that I would have liked to see, but I disagree with those who suggest that senators are giving in by backing this bill.

The fact is that the alternative to Congress passing this bill is Congress enacting far worse legislation that the Senate had already passed by a filibuster-proof margin, and which a majority of House members were on record as supporting.

What’s more, this bill provides important safeguards for civil liberties. It includes effective mechanisms for oversight of the new surveillance authorities by the FISA court, the House and Senate Intelligence Committees and now the Judiciary Committees. It mandates reports by inspectors general of the Justice Department, the Pentagon and intelligence agencies that will provide the committees with the information they need to conduct this oversight. (The reports by the inspectors general will also provide accountability for the potential unlawful misconduct that occurred during the Bush administration.) Finally, the bill for the first time requires FISA court warrants for surveillance of Americans overseas.

As someone whose civil liberties were violated by the government, I understand this legislation isn’t perfect. But I also believe — and here I am speaking only for myself — that it represents our best chance to protect both our national security and our civil liberties. For that reason, it has my personal support.

On the same day, the New York Times Editorial Board wrote against the bill (’Compromising the Constitution‘, 8 July 2008):

The Senate should reject a bill this week that would needlessly expand the government’s ability to spy on Americans and ensure that the country never learns the full extent of President Bush’s unlawful wiretapping.

[..]

Supporters will argue that the new bill still requires a warrant for eavesdropping that “targets” an American. That’s a smokescreen. There is no requirement that the government name any target. The purpose of warrantless eavesdropping could be as vague as listening to all calls to a particular area code in any other country.

The real reason this bill exists is because Mr. Bush decided after 9/11 that he was above the law. When The Times disclosed his warrantless eavesdropping, Mr. Bush demanded that Congress legalize it after the fact. The White House scared Congress into doing that last year, with a one-year bill that shredded FISA’s protections. Democratic lawmakers promised to fix it this year.

[..]

The bill dangerously weakens the 1978 Foreign Intelligence Surveillance Act, or FISA. Adopted after the abuses of the Watergate and Vietnam eras, the law requires the government to get a warrant to intercept communications between anyone in this country and anyone outside it — and show that it is investigating a foreign power, or the agent of a foreign power, that plans to harm America.
Proponents of the FISA deal say companies should not be “punished” for cooperating with the government. That’s Washington-speak for a cover-up. The purpose of withholding immunity is not to punish but to preserve the only chance of unearthing the details of Mr. Bush’s outlaw eavesdropping. Only a few senators, by the way, know just what those companies did.

And today, the Washington Post, often somewhat more centrist on civil liberties matters than the Times editorialized (’FISAs Fetters‘, 9 July 2008):

These are serious concerns, worth taking seriously. We are under no illusion that the measure is perfect; future fine-tuning may well be called for. The classified nature of the surveillance program makes it impossible to assess the implications with anything near certainty. But the legislation reflects, as far as we can tell, a reasonable compromise, worked out over long months of negotiations, between the legitimate needs of intelligence agencies and the legitimate privacy interests of Americans.

The measure requires an individualized, court-approved warrant to conduct surveillance targeted at Americans’ communications with those overseas and — in an expansion of existing FISA protections — at Americans abroad. Purely domestic-to-domestic communications, even among foreigners here, would require a warrant as well. Intelligence agencies would be able to target and collect the communications of non-Americans “reasonably believed to be located outside the United States,” even if their phone calls or e-mails passed through or were stored in the United States. But the agencies are required to adopt procedures to “prevent the intentional acquisition” of purely domestic communications and to minimize the retention and dissemination of such information.

more to come…

Google, Viacom, Privacy and Copyright meet the social web

Submitted by Danny Weitzner on Thu, 2008-07-03 21:43. ::

The original appearance of this entry was in Danny Weitzner - Open Internet Policy

In all the recent uproar (New York Times, “Google Told to Turn Over User Data of YouTube,” Michael Helft, 4 July 2008) about the fact that Google has been forced to turn over a large pile of personally-identifiable information to Viacom as part of a copyright dispute (Opinion), there is a really interesting angle pointed out by Dan Brickley (co-creator of FOAF and general Semantic Web troublemaker). Dan points out in a blog entry today that while the parties before the court are arguing about whether the YouTube ID is, by itself, personally identifiable information, the fact is that the publicly visible part of this ID in the context of other information on the Web is sufficient to identify a lot about a person, not the least of which is their name. Dan explains:

YouTube users who have linked their YouTube account URLs from other social Web sites (something sites like FriendFeed and MyBlogLog actively encourage), are no longer anonymous on YouTube. This is their choice. It can give them a mechanism for sharing ‘favourited’ videos with a wide circle of friends, without those friends needing logins on YouTube or other Google services. This clearly has business value for YouTube and similar ’social video’ services, as well as for users and Social Web aggregators.

Given such a trend towards increased cross-site profile linkage, it is unfortunate to read that YouTube identifiers are being presented as essentially anonymous IDs: this is clearly not the case. If you know my YouTube ID ‘modanbri’ you can quite easily find out a lot more about me, and certainly enough to find out with strong probability my real world identity. As I say, this is my conscious choice as a YouTube user; had I wanted to be (more) anonymous, I would have behaved differently. To understand YouTube IDs as being anonymous accounts is to radically misunderstand the nature of the modern Web.

Dan makes a really important point here. One the on hand, the fact that we are all more identifiable as a result of social networks in which we exist suggests that the judge was just plain wrong (even wronger than others have already said) in saying that the YouTube IDs are not personally-identifiable. But on the other hand, to the extent that Dan is correct about the revealing nature of the social web (true for some of us now, more and more in the future), we have to face the fact that merely limiting disclosure of personal information from one source is less and less unlikely to protect privacy effectively across the Web.

Applying this view to the Viacom v. YouTube case suggests that privacy protection has to focus more limiting how people and institutions can *use* personal information even as we recognize that it is harder and harder to protect privacy by access control alone.

Some of my colleagues and I have written about this view of privacy as Information Accountability in last month’s Communications of the ACM.

A Political Denial of Service (PDOS) attack on blogger.com?

Submitted by Danny Weitzner on Tue, 2008-07-01 19:31. ::

The original appearance of this entry was in Danny Weitzner - Open Internet Policy

A little transparency would go a long way toward helping keep online political discourse open, especially in the particular corner of the blogosphere run by Google (ie. blogger.com). The Herald Tribune (Bloggers take aim at Google - International Herald Tribune) reports on a controversy involving pro-Clinton blogs that might have been blocked as spam due to what we might call a PDOS (Political Denial of Service Attack) in a skirmish between Obama and Clinton partisans. The IHT asks:

Was Google’s network of online services manipulated to silence critics of Barack Obama? That was the question buzzing on a corner of the blogosphere over the past few days, after several anti-Obama bloggers were unable to update their sites, which are hosted on Googles Blogger service.

It is alleged that some pro-Clinton blogs were blocked after a number of pro-Obama users marked them as ’spam’ on blogger.com. A Google spokesperson explained:

“It appears that our anti-spam filters caused some Blogger accounts to be blocked from creating new posts,” a Google spokesman, Adam Kovacevich, said in a statement. “While we are still investigating, we believe this may have been caused by mass spam e-mails mentioning the ‘Just Say No Deal’ network of blogs, which in turn caused our system to classify the blog addresses mentioned in the e-mails as spam.”

Kovacevich said that Google had restored posting rights to the affected blogs and that it was “very important” to Google “that Blogger remain a tool for political debate and free expression.” He gave no further details about Google’s spam-monitoring techniques or how they relate to the Blogger service.

It certainly would be useful if Google could provide some transparency into what they block and why. That way, either Google or the possibly malicious spam-flaggers could be help accountable for their behavior. (In a recent CACM piece on Information Accountability we explain why accountability is so important on the Web and how we might have more of it through additions to the architecture of the Web.)

Google does a very good job of giving transparent explanations when their search results contain information that has been blocked for legal reasons such as copyright takedown notices. I hope they can find a way to bring similar transparency to their part of blogosphere.

Important New Jersey Supreme Court decision in Internet privacy

Submitted by Danny Weitzner on Wed, 2008-04-23 09:35. ::

The original appearance of this entry was in Danny Weitzner - Open Internet Policy

The New Jersey Supreme Court (State of New Jersey v. Shirley Reid (A-105-06)) has issued an important decision on Internet users’ right to privacy. The case involves a dispute about whether an ISP violated a user’s privacy rights by turning over subscriber information (name, address, billing details) associated with a particular IP address. It ends up the that subpoena served on the ISP was invalid for a variety of reasons. As the user had a ‘reasonable expectation of privacy’ in her Internet activities and identifying information, and because the subpoena served on the ISP was invalid, the New Jersey court determined that the ISP should not have turned over the personal data.

The important aspect of this case in the evolving understanding of privacy on the Internet is the court’s recognition that we must look at privacy from the broad perspective of what can actually be discovered about people online. In this way, the ruling has significant strengths and weaknesses from a privacy perspective. On the one hand, the court finds that there is, today, an expectation of privacy in IP addresses because they are currently hard to link to personal identity. There have been lots of disputes in the US and the EU about whether IP addresses are ‘personally identifying information.’ (”PII” in the jargon of privacy.) This court takes a pragmatic view of this question and finds that IP addresses should be considered private for now, but that this may change. The court finds:

the reasonableness of the privacy interest may change as technology evolves. A reasonable expectation of privacy is required to establish a protected privacy interest…. Internet users today enjoy relatively complete IP address anonymity when surfing the Web. Given the current state of technology, the dynamic, temporarily assigned, numerical IP address cannot be matched to an individual user without the help of an ISP. Therefore, we accept as reasonable the expectation that one’s identity will not be discovered through a string of numbers left behind on a website.

The availability of IP Address Locator Websites has not altered that expectation because they reveal the name and address of service providers but not individual users. Should that reality change over time, the reasonableness of the expectation of privacy in Internet subscriber information might change as well. For example, if one day new software allowed individuals to type IP addresses into a “reverse directory” and identify the name of a user — as is possible with reverse telephone directories — today’s ruling might need to be reexamined.

Others have written about the legal details of this case and have suggested that it is a big win for privacy. Given the reliance on the shifting state of identity technology, I’m a little less sanguine.

This case is yet another reason why I believe (as I’ve explained elsewhere) that meaningful privacy on the Web requires rules the govern how personal information is used, not just what can be collected. Under the court’s reasoning, as our lives become more and more transparent, that would justify increasing harmful use of personal data. While it’s pretty hard to control how exposed we are all become, we still can limit how powerful institutions (governments, etc.) use personal data about us.

Bob Metcalfe's wisdom on patents and innovation

Submitted by Danny Weitzner on Thu, 2008-04-10 21:46. ::

The original appearance of this entry was in Danny Weitzner - Open Internet Policy

Ethernet inventor, journalist and now venture capitalist Bob Metcalfe speaks on the lessons from the Internet community for the global warming arena. In looking at how to accelerate technical innovation to address climate change, Metcalfe asserts that:

“… the place to do research is in university labs. “The best vehicle for technology innovation is not patents, it’s students.”

Of course, Bob also manages to express is distain for monopoly, Bell Labs, and even Al Gore. (See report by Martin LaMonica.) I’m not sure about those but think he’s right on with respect to patents.

On meetings

Submitted by Danny Weitzner on Sun, 2008-03-30 11:16. ::

The original appearance of this entry was in Danny Weitzner - Open Internet Policy

Ever the astute observer of the various features and bugs of our collective behavior, a longtime mentor of mine, Mitch Kapor, has coined a new defintion:

Meetingboarding: (n) the sensation of being unable to breathe arising from continuous immersion in meeting after meeting

I’d add to this a characterization of email that I learned from Mitch many years ago:

The problem with email is that it has low emotional bandwidth.
-Mitch Kapor, circa 1991

Semantic Web in the news

Submitted by timbl on Thu, 2008-03-27 16:43. ::

Well, the Semantic Web has been in the news a bit recently.

There was the buzz about Twine, a "Semantic Web company", getting another round of funding. Then, Yahoo announced that it will pick up Semantic Web information from the Web, and use it to enhance search. And now the Times online mis-states that I think "Google could be superseded". Sigh. In an otherwise useful discussion largely about what the Semantic Web is and how it will affect people, a misunderstanding which ended up being the title of the blog. In fact, the conversation as I recall started with a question whether, if search engines were the killer app for the familiar Web of documents, what will be the killer app for the Semantic Web.

Text search engines are of course good for searching the text in documents, but the Semantic Web isn't text documents, it is data. It isn't obvious what the killer apps will be - there are many contenders. We know that the sort of query you do on data is different: the SPARQL standard defines a query protocol which allows application builders to query remote data stores. So that is one sort of query on data which is different from text search.

One thing to always remember is that the Web of the future will have BOTH documents and data. The Semantic Web will not supersede the current Web. They will coexist. The techniques for searching and surfing the different aspects will be different but will connect. Text search engines don't have to go out of fashion.

The "Google will be superseded" headline is an unfortunate misunderstanding. I didn't say it. (We have, by the way, asked it to be fixed. One can, after all, update a blog to fix errors, and this should be appropriate. Ian Jacobs wrote an email, left voice mail, and tried to post a reply to the blog, but the reply did not appear on the blog - moderated out? So we tried.)

Now of course, as the name of The Times was once associated with a creditable and independent newspaper :-), the headline was picked up and elaborated on by various well-meaning bloggers. So the blogosphere, which one might hope to be the great safety net under the conventional press, in this case just amplified the error.

I note that here the blogosphere was misled by an online version of a conventional organ. There are many who worry about the inverse, that decent material from established sources will be drowned beneath a tide of low-quality information from less creditable sources.

The Media Standards Trust is a group which has been working with the Web Science Research Initiative (I'm a director of WSRI) to develop ways of encoding the standards of reporting a piece of information purports to meet: "This is an eye-witness report"; or "This photo has not been massaged apart from: cropping"; or "The author of the report has no commercial connection with any products described"; and so on. Like creative commons, which lets you mark your work with a licence, the project involves representing social dimensions of information. And it is another Semantic Web application.

In all this Semantic Web news, though, the proof of the pudding is in the eating. The benefit of the Semantic Web is that data may be re-used in ways unexpected by the original publisher. That is the value added. So when a Semantic Web start-up either feeds data to others who reuse it in interesting ways, or itself uses data produced by others, then we start to see the value of each bit increased through the network effect.

So if you are a VC funder or a journalist and some project is being sold to you as a Semantic Web project, ask how it gets extra re-use of data, by people who would not normally have access to it, or in ways for which it was not originally designed. Does it use standards? Is it available in RDF? Is there a SPARQL server?

A great example of Semantic Web data which works this way is Linked Data. There is growing mass of interlinked public data much of it promoted by the Linked Open Data project. There is an upcoming Linked Data workshop on this at the WWW 2008 Conference in April in Beijing, and in June 17-18 in New York at the Linked Data Planet Conference. Linked data comes alive when you explore it with a generic data browser like the Tabulator. It also comes alive when you make mashups out of it. (See Playing with Linked Data, Jamendo, Geonames, Slashfacet and Songbird ; Using Wikipedia as a database). It should be easier to make those mashups by just pulling RDF (maybe using RDFa or GRDDL) or using SPARQL, rather than having to learn a new set of APIs for each site and each application area.

I think there is an important "double bus" architecture here, in which there are separate markets for the raw data and for the mashed up data. Data publishers (e.g., government departments) just produce raw data now, and consumer-facing sites (e.g., soccer sites) mash up data from many sources. I might talk about this a bit at WWW 2008.

So in scanning new Semantic Web news, I'll be looking out for re-use of data. The momentum around Linked Open Data is great and exciting -- let us also make sure we make good use of the data.

Today - NPR Science Friday program on Web privacy issues

Submitted by Danny Weitzner on Fri, 2008-03-21 12:24. ::

The original appearance of this entry was in Danny Weitzner - Open Internet Policy

National Public Radio’s Science Friday program will feature a discussion of online privacy with Alessandro Acquisti of CMU and yours truly a little later today. It’s live from 3:00 - 4:00 pm Eastern/US, rebroadcast at various times depending on where you live, and streamed on the Web.

Listen it. Call and challenge other listeners to think about the privacy questions raised by the Semantic Web!

Update: the broadcast is streamed at this link.

Syndicate content