Unless a particular piece of content on the Web has some strict access control policies, most users do not feel the need to check for the license it is under and be license compliant. To verify this hypothesis we conducted an experiment to assess the level of license violations on the Web. Specifically, the goal of the experiment was to obtain an estimation for the level of CC attribution license violations on the Web using Flickr images.

Experiment Setup

The sampling method used for the experiment was simple random sampling on clusters of Web pages gathered during a particular time frame. To ensure a fair sample we used the Technorati blog indexer without hand-picking Web pages to compose the sample to check for attribution license violations. We limited the number of Web pages to around 70, and the number of images to less than 500, so that we could do a manual inspection to see if there are any false positives, false negatives and/or any other errors. This also enabled us to check if the different samples contained the same Web pages. We found that the correlation among the samples was minimal.

Sample Collection using the Technorati API

The Technorati blog indexer crawls and indexes blog-style Web pages and keeps track of what pages link to them, what pages they link to, how popular they are, how popular the pages that link to them are, and so on. Technorati data are time dependent, and therefore the Technorati Authority Rank, a measurement that determines the top "n" results from any query to the Technorati API, is based on the most recent activity of a particular Web page. We actually expected that the use of the Technorati Authority Rank would introduce a bias in our sample. This is because the top Web pages from the Technorati blog indexer are probably well visited, hence more pressure on the Web page owners to fix errors in attribution. However, the results proved otherwise!.

The Technorati Cosmos querying functions allow the retrieval of results for blogs linking to a given base URI based on the authority rank. Therefore to generate the samples, we used the Technorati Cosmos functions by retrieving results for Web pages linking to Flickr server farm URIs that have this particular format:

(According to http://www.flickr.com/services/api/misc.urls.html, all Flickr images have that particular URI pattern).
Since the Flickr site has several server farms, each time the experiment was run, the base URIs were randomly generated by altering the Flickr server 'farm-id's. In addition to that, we made sure that the samples were independent of each other and the correlation among the samples were low by running the experiment three times with two weeks between each trial. This is because the Authority Rank given to a Web page by Technorati, and hence the results returned from the Cosmos query functions dynamically changes as new content gets created.

Criteria for Checking Attribution

Flickr is still using the older CC 2.0 recommendation. Therefore, Flickr users do not have that much flexibility in specifying their own attributionURL or the attributionName as specified in ccREL. However, it is considered good practice to give attribution by linking to the Flickr user profile or by giving the Flickr user name (which could be interpreted as the attributionURL and the attributionName respectively), or at least, point to the original source of the image. Therefore, the criteria for checking attribution consist of looking for the attributionURL or the attributionName or any source citations within a reasonable level of scoping from where the image is embedded in the Document Object Model (DOM) of the corresponding Web page.

Results from the Experiment

Here are the results from the experiment conducted to check how much CC-BY (attribution) license violations on Flickr images are there on the Web. The code used to run the experiment can be found here.

Sample 1

Sample 2

Sample 3

Here is a summarized view of the results:

Experiment Results Summarized

These results have misattribution and non-attribution rates ranging from 78% to 94% signaling that there is a strong need to promote license or policy awareness among reusers of content. The entire result set includes the total number of Web pages tested, number of images in all of those Web pages, number of properly attributed images, number of misattributed or non-attributed images, and the number of instances that led to an error due to parsing errors resulting from bad HTML. Using these values, the percentages of misattribution and non-attribution for each sample were calculated.

Issues and Limitations of the Experiment