Logging/Authenticating SPARQL client and server
Mike Stunes

$Date: 2008-08-27 13:37:40 -0400 (Wed, 27 Aug 2008) $
$Revision: 24930 $
$Author: stunes $

$Id: README 24930 2008-08-27 17:37:40Z stunes $

Contents
--------
0. Quick Start
1. Introduction
2. Dependencies
3. Files
4. Internal variables
5. Access Policy
6. Privacy Policy
7. Logging
8. Creating server instances
9. Client
10. TODOs


0. Quick Start
--------------

To run the sparql server

a. Make sure you've installed the following and that they are added to your pythonpath

rdflib <http://rdflib.net>
Tested with version 2.4.0.

Python OpenID library <http://openidenabled.com/python-openid>
Tested with version 2.1.1 of this library--version 1.x known not to work.
Should work with any 2.x version.

ElementTree <http://effbot.org/zone/element-index.htm>
(By default, this will try to use the built-in ElementTree from Python 2.5
and above, falling back on the external ElementTree if necessary.)

PyOpenSSL <http://pyopenssl.sourceforge.net/>
Tested with version 0.7.

Cwm <http://www.w3.org/2000/10/swap/doc/cwm.html>
Place a symlink named "swap" in the same directory as sparql_server_ssl.py,
pointing at the "swap" directory of your cwm distribution, or otherwise
make sure that the "swap" directory is accessible to Python.

b. Generate a self signed certificate and key using openssl
 
> openssl req -new -x509 -keyout key.pem -out cert.pem

Move these into the certs directory

c. Modify the configuration file, sparql_server_config
- change rules_file to your authorization file or use access_policy/access_policy.n3 to restrict access to DIG members
- change policy_file to the usage policy or use access_policy/air_access_policy.n3 as a default
- change keyfile to the key you just generated
- change certfile to the certificate you just generated
- change data_* to the mysql store that contains the data you want to protect. data_identifier should be set to rdfstore (rdflib requirement)
- change log_* to the mysql store where you want to store the log of queries. data_identifier should be set to rdfstore

d. Start the mysql server(s)

e. Start the sparql server

> python sparql_server_ssl.py

f. You will be asked to enter the pass phrase of the key/certificate you generated. Enter the phrase

g. The script will attempt to import the libraries and then start the server. It will provide the port number at which the server is running 
https://server:8080/

h. To test the server open your Firefox browser and go to 
https://server:8080/webform

i. You will be asked to log in with your openid

j. After successfully logging in, you will see a query interface. Type out a query that matches the data in your store or use the default SELECT * WHERE {?s ?p ?o.}

k. If properly installed, you will receive the sparql results tagged with the usage policy


1. Introduction
---------------

This set of programs is an implementation of the SPARQL protocol, with some
extensions that are useful for working with sensitive data. The server almost
appears to be a standard SPARQL endpoint, with the difference that it requires
OpenID authentication from the client. The endpoint also supports SSL
encryption with a server-side certificate, to keep data exchanges confidential.
Also, the server logs all incoming queries in an RDF format. See "Logging"
below for more details.

The server allows for a function to determine whether a given OpenID is allowed
to use the server. See "Access Policy" below for more details.

The server uses rdflib stores to hold the server's data and log. These are not
created automatically.

Navigating to the base URI of the server will provide the user with a login
form where he/she can enter his/her OpenID. Other pages under the root
directory that are used:

For interactive use through a browser:
/webform -- interactive webform where a user can enter a query
/login -- handles requests for session credentials (user is sent here from
    the login page)
/loginComplete -- finishes the authentication process (user is sent here from
    his/her OP)

For noninteractive use:
/getSession -- page the client library requests when starting authentication
/clientLogin -- page that /getSession redirects to; opened in a browser by the
    client library; redirects to OP
/clientComplete -- page that OP redirects to; finishes the authentication
    process

For both uses:
/sparql -- page where actual queries are sent
/policy -- page that returns an applicable privacy/usage policy

Navigating to any other path will return a 404 Not Found.

The client library provides a class that encapsulates the interface to an
instance of the server. More details will be provided below.


2. Dependencies
---------------

The server requires that you have the following libraries installed:

rdflib <http://rdflib.net>
Tested with version 2.4.0.

Python OpenID library <http://openidenabled.com/python-openid>
Tested with version 2.1.1 of this library--version 1.x known not to work.
Should work with any 2.x version.

ElementTree <http://effbot.org/zone/element-index.htm>
(By default, this will try to use the built-in ElementTree from Python 2.5
and above, falling back on the external ElementTree if necessary.)

PyOpenSSL <http://pyopenssl.sourceforge.net/>
Tested with version 0.7.

Also, the included example code requires a cwm distribution on your machine.
Place a symlink named "swap" in the same directory as sparql_server_ssl.py,
pointing to the "swap" directory of your cwm installation.


3. Getting Started
------------------

First, read "Dependencies" above and make sure all of the needed libraries are
installed.

The server needs two rdflib graphs available, one for logging and one for data,
which by default are set up to be persistent MySQL stores. The database
parameters are specified in sparql_server_config.

New rdflib stores can be created easily. This example will show how to create
one on a database "test", on a MySQL server running on sql.example.org:
(Substitute "user" and "pw" with a valid username and password, respectively.)
"identifier" in the third line is a string that is hashed to create the table
names in the MySQL database, allowing one database to hold more than one RDF
store. This identifier is specified, along with the database parameters, in the
config file.

>>> import rdflib
>>> configString = "host=sql.example.org,user=user,password=pw,db=test"
>>> store = rdflib.plugin.get('MySQL', rdflib.store.Store)(identifier)
>>> store.open(configString, create=True)

Data from an RDF file or URI pointing to RDF data can be imported like so:

>>> graph = rdflib.ConjunctiveGraph(store)
>>> graph.parse('http://example.org/some_data.rdf')
>>> graph.commit()

Changes to a persistent rdflib store will not become permanent until calling
commit().

Another good idea is to create or otherwise acquire a valid SSL certificate for
the server, and edit the config file (sparql_server_config by default) to point
to the certificate and key files. A self-signed certificate can easily be made
using OpenSSL:

> openssl req -new -x509 -keyout key.pem -out cert.pem

The server includes default files to use as an access control policy and a
policy attached to the results, but different files can be specified using the
options rules_file and policy_file in the config.

Also, the server needs a directory to store persistent OpenID information
(nonces, etc.) which by default is a subdirectory named "openids" in the
directory containing the executable.

After all of the external components described above are in place, the server
may be started by executing "sparql_server_ssl.py".


4. Files
--------

sparql_server_new.py -- The server executable.
sparql_server_ssl.py -- The server, with SSL support.
sparql_client.py -- The client library.


5. Internal variables
---------------------

debug: if True, print debugging output to the console.
verbose: if True, print additional information above and beyond debugging
    to the console.


6. Access Policy
----------------

The server allows for a function that implements the server's access control.
The server, when initialized, is passed a function, with this spec:

some_fn(openid) -> bool

The function takes an OpenID identifier, as a string, and returns a boolean
True if the user is allowed to use the server, False otherwise.


7. Privacy Policy
-----------------

The server allows for a function that returns a privacy/usage policy to the
client when a request is made for "https://<server>/policy". A link to this
policy is also incorporated into the results returned from the server, via the
SPARQL "link" tag.

This function should take an instance of SparqlRequestHandler (or any subclass
of BaseHTTPRequestHandler) and return the applicable policy to the client. This
is done this way so that the policy can be returned as any MIME type with any
headers that are necessary.


8. Logging
----------

The server logs all requests on the database in an rdflib store, which is
passed to the server when it is initialized. The server logs the following
triples to the log store:

@prefix data: <http://web.mit.edu/stunes/www/urop/log_ont.n3#>

:<identifier> rdf:type data:LogItem;
    data:hasQuery <query_string>^^owl:normalizedString;
    data:hasRequester <client-hostname>^^owl:normalizedString;
    data:hasTimestamp <timestamp>^^owl:dateTime;
    data:hasOpenId <openid>^^owl:anyURI;
    data:authenticated <authenticated?>^^(data:authenticatedTrue | data:authenticatedFalse)


9. Creating server instances
----------------------------

A server class is created like so:

foo = SparqlServer(rdfstore, logstore, openid_store, authorizedFunc, returnPolicyFunc, KEYFILE, CERTFILE, serveraddress, RequestHandlerClass)

with these parameters:

rdfstore: instance of rdflib.Graph that stores the server's data.
logstore: instance of rdflib.Graph that the server logs to.
openid_store: instance of openid.store.filestore.FileOpenIDStore that holds persistent OpenID data
authorizedFunc: function described in "Access Policy" above
returnPolicyFunc: function that returns a privacy/usage policy to the user. See "Privacy Policy" above
KEYFILE: filename (string) of the (private?) key file to use for SSL
CERTFILE: filename (string) of the certificate file to use for SSL
serveraddress: tuple of (hostname, port) that the server listens on
RequestHandlerClass: this should be SparqlRequestHandler.

A function start_server has been provided for convenience.


10. Client
---------

The client library provides a class that encapsulates a connection to a SPARQL
server. Suppose that we have a server running at https://tasty-snack.mit.edu:3456.
We can create an object representing it as follows:

import sparql_client
from sparql_client import SparqlWrapper

foo = SparqlWrapper('https://tasty-snack.mit.edu:3456')

After creating the wrapper, it is necessary for the client to authenticate
him/herself to the server with his/her OpenID. This is done like so:

foo.authenticate('http://somebody.youropenid.com')

This will open a browser window pointed to a page on the server that will
redirect to the user's OP. After the authentication is done, the user will
get redirected to a listener that the client has started on localhost:9876,
which will present the user with an informational message stating that
the authentication has succeeded, and that queries may now be run.

Queries are run like so:

foo.setQueryString('SELECT * WHERE {?s ?p ?o}')
(optionally) foo.setReturnFormat(<return-format>) (currently not implemented)
(optionally) foo.addExtraURITag(key, value)
results = foo.query()

query() returns an instance of SparqlResults, which functions as an iterator
that returns the XML returned by the server. It also supports the following
methods:

getURL(): returns the URL used to get the query results
getInfo(): returns additional metadata from the query results

Extra support for parsing query results into more usable data structures may
be implemented in the future.


11. TODOs
--------