\chapter{System Detail}

In this chapter, we discuss the implementation details of the policy assurance system. Each section details a particular component of the system, discussing the design assumptions, choices, and tradeoffs that were necessary. This chapter serves as full and complete documentation of this project for future users and developers of the system.

This section benefits from a working knowledge of RDF, N3, AIR, SPARQL, Python, and C++. The code to support this discussion is included in the appendix.

\section{SPARQL Query Translation}

The policies that we defined in AIR in the previous section require SPARQL queries to be in a particular N3 syntax in order to correctly perform reasoning. The design of the AIR policy and the SPARQL query translation go hand in hand. The first approach at query reasoning used a SPARQL translation that maintained the semantics of the SPARQL query, in fulfillment of the Phase we milestone for this project. In time, we moved to a translation that removes many of the semantics of SPARQL queries. We discuss the details of the new translation herein.

\subsection{SPARQL to N3 Web Page}

\begin{figure}
\centering
\includegraphics[scale=0.5]{translator}
% Make sure to include a screenshot of the translator.
\caption{A screenshot of the SPARQL to N3 translator Web page.}
\label{translator}
\end{figure}

The SPARQL to N3 Web page serves as a front-end to the conversion process discussed in the previous section. The user inputs a well-formed SPARQL query into the form on the page. Upon clicking submit, the page's backend script will pass the query to the translator, a command-line based utility. The translator then returns an N3 translation of the user's SPARQL query to the Web page.

A user may use the SPARQL to N3 translator directly to help debug the policy process, or to gain understanding of the query process. It is important to note that the N3 translation does not preserve all SPARQL semantics, nor does it preserve order in some cases. In the majority of use cases, the user will interact indirectly with the translator.

\subsection{Query Conversion Ontology}

We define an ontology for N3 conversion of generic queries in table \ref{q-ont}. All of these names are defined in the query conversion namespace, which is:

\begin{Verbatim}
http://dig.csail.mit.edu/2009/policy-assurance/sparql.n3#
\end{Verbatim}

\begin{table}
\centering
	\begin{singlespace}
	\begin{itemize}
	\item \Verb|:SPARQLQuery|. \emph{Refers to a query converted to N3.}
		\begin{itemize}
		\item \Verb|:source|. \emph{A data source for the query. In SPARQL, this can be FROM or FROM NAMED. A \Verb|:SPARQLQuery| may have many of these.}
		\item \Verb|:retrieve|. \emph{A list of variables to which the \Verb|:SPARQLQuery| refers. A \Verb|:SPARQLQuery| has exactly one of these.}
			\begin{itemize}
			\item \Verb|:var|. \emph{A member of the \Verb|:retrieve| which describes a particular variable.}
			\end{itemize}
		\item \Verb|:clause|. \emph{The contents of the \Verb|:SPARQLQuery|. This corresponds to the WHERE clause in SPARQL. A \Verb|:SPARQLQuery| has exactly one of these.}
			\begin{itemize}
			\item \Verb|:triplePattern|. \emph{A particular line or binding in a \Verb|:SPARQLQuery|.}
			\end{itemize}
		\end{itemize}
	\end{itemize}
	\end{singlespace}
\caption{Query conversion ontology.}
\label{q-ont}
\end{table}

\begin{figure}
\centering
\includegraphics[scale=0.5]{Nice_Abstract_Sparql_Diagram}
\caption{Ontology diagram of the SPARQL translation, courtesy of Yotam Aron.}
\label{abstract-query}
\end{figure}

The ontology is intentionally small, short, and simple. Whereas the first iteration of query translation focused to have a bidirectional, complete N3 translation of a SPARQL query, the goal of this ontology is to facilitate the conversion of queries in additional languages, and to contain the minimum amount of information necessary so as to continue to be useful in making assertions according to our current policies. (An important future work of this project, and perhaps the largest hurdle to widespread adoption, is the support of SQL queries.)

The next subsections will demonstrate exactly how we convert a query.

\subsection{swobjects: Parsing and Serializing}

The current implementation of the SPARQL to N3 translator is an adaptation of Eric Prud'Hommeaux's swobjects code\cite{swobjects}. swobjects, per its homepage, is a suite of tools written in C++ for performing operations upon Semantic Web objects. swobjects is an important part of the SPASQL project, a project whose goal is to facilitate data integration from multiple databases\cite{spasql}.

This project modifies the serialization engine of swobjects. The swobjects code, in one mode of operation, will accept a SPARQL query as input, and print the same SPARQL query as output. The swobjects code performs parsing and lexing of the input, and creates a parse tree that it expresses. By overloading the classes of the SPARQLSerializer code in swobjects, we were able to modify the output of a SPARQL query. By carefully defining the output and performing a small amount of additional processing, we are able to perform conversion.

The C++ code that overloads the SPARQLSerializer is located in the DIG repository at

\begin{Verbatim}
http://dig.csail.mit.edu/2009/policy-assurance/sparql2n3.cpp
\end{Verbatim}

\subsection{SPARQL Language Translation}

In this section, we discuss the details of translating a SPARQL query to N3. The discussion closely follows the W3C Recommendation for SPARQL \cite{sparql}, and uses some queries from the DAWG test case\cite{dawg-test}. We discuss which features we support, and how we support them. We explain which features of the language we do not support, and why we chose the approach that we did in a few situations. The translation is incomplete; in particular, the GRAPH feature does not work, nested queries do not work, and the FILTER keyword is only partially supported. Furthermore, whitespace generation needs improvement. Nevertheless, this translation is suitable for the majority of simple SPARQL queries.

As of this writing, the query translator is live. To use it, please visit:

\begin{Verbatim}
http://dig.csail.mit.edu/2009/policy-assurance/sparql2n3.py
\end{Verbatim}

\subsubsection{Namespaces and URIs: BASE and PREFIX}

A SPARQL query may specify its own namespace using the @prefix keyword, or the @base keyword:

\begin{Verbatim}
@prefix dc:   <http://purl.org/dc/elements/1.1/> .
\end{Verbatim}

A namespace is shorthand, allowing truncation of URIs. For lack of ambiguity, we kept the swobjects feature of expanding or flattening all URIs. The output of the translation only specifies one name space: the namespace of the translation ontology.

\subsubsection{Query Identification}

Every query that goes through the translator is assigned a unique name. In the current implementation, we cannot be assured of the uniqueness of the name, but as it is assigned by a random number plus the current time stamp (seconds since UNIX epoch), the probability of 1 in $2^{32}$ of a collision in one second is acceptably low. A one-way hash of a query does not work, as this causes successive translations of the same query to have the same name. It is conceivable that a user might run the same query, or a series of very similar queries, an arbitrary number of times. Our policies depend on uniquely named policies to correctly determine which policies are in compliance or violation of a policy.

\subsubsection{SELECT}

A SPARQL query with a SELECT clause ``returns all, or a subset of, the variables bound in a query pattern match.'' Thus, a query with a SELECT clause has a variable part and a triple pattern part.

If we input the query:

\begin{singlespace}
\begin{Verbatim}
# Sample SPARQL query that queries for SSN, age, OpenID.
PREFIX example: <http://example.com/#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?s ?n WHERE {
     ?s example:ssn ?n. ?s foaf:age ?a. ?s foaf:openid ?id.}
\end{Verbatim}
\end{singlespace}

The translator will generate the N3 output:

\begin{singlespace}
\begin{Verbatim}
@prefix s: <http://dig.csail.mit.edu/2009/IARPA-PIR/sparql#> .

:Query9743408551250696473 a s:SPARQLQuery;

s:clause [
  s:triplePattern  { :s <http://example.com/#ssn> :n };
  s:triplePattern  { :s <http://xmlns.com/foaf/0.1/age> :a };
  s:triplePattern  { :s <http://xmlns.com/foaf/0.1/openid> :id };

]; 
   s:retrieve [
      s:var :n;
      s:var :s;
].
\end{Verbatim}
\end{singlespace}

\subsubsection{SELECT *}

The syntax SELECT * is an abbreviation that selects all of the variables in a query\cite{sparql}. To implement this, the translator adopts a special behavior when it sees SELECT *. It will maintain a set of all variables mentioned in the body of the query. At the end of the clause, after processing is complete, the translator will print a list of every variable seen. In effect, this implements the correct handling of SELECT *, in line with the ``flattening'' philosophy of the translator.

Given this example query, with a SELECT *,

\begin{singlespace}
\begin{Verbatim}
# Sample SPARQL query that queries for SSN, age, OpenID.
PREFIX example: <http://example.com/#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT * WHERE {
     ?s example:ssn ?n. ?s foaf:age ?a. ?s foaf:openid ?id.}
\end{Verbatim}
\end{singlespace}

The translator will generate the N3 output:

\begin{singlespace}
\begin{Verbatim}
@prefix s: <http://dig.csail.mit.edu/2009/IARPA-PIR/sparql#> .

:Query14924979641250696494 a s:SPARQLQuery;

s:clause [
  s:triplePattern  { :s <http://example.com/#ssn> :n };
  s:triplePattern  { :s <http://xmlns.com/foaf/0.1/age> :a };
  s:triplePattern  { :s <http://xmlns.com/foaf/0.1/openid> :id };

]; 
   s:retrieve [
      s:var :a;
      s:var :id;
      s:var :n;
      s:var :s;
].
\end{Verbatim}
\end{singlespace}

\subsubsection{CONSTRUCT}

A CONSTRUCT query is very similar to a SELECT, except in that it creates a new RDF graph with its output. In effect, it is a union of all of the solution sets. Thus, the useful output of a CONSTRUCT query is the same as a similar SELECT query, in that it will only output explicit and not blank variables. The translator handles variables in a CONSTRUCT similarly to a SELECT *: it will output every non-blank variable mentioned in the CONSTRUCT part of the clause.

Given this example query, with a CONSTRUCT,

\begin{singlespace}
\begin{Verbatim}
CONSTRUCT { ?s <p1> <o> . ?s <p2> ?o } WHERE {?s ?p ?o}
\end{Verbatim}
\end{singlespace}

The translator will generate the N3 output:

\begin{singlespace}
\begin{Verbatim}
@prefix s: <http://dig.csail.mit.edu/2009/IARPA-PIR/sparql#> .

:Query13731259441250696648 a s:SPARQLQuery;
    s:clause [
  s:triplePattern  { :s :p :o };
]; 
   s:retrieve [
      s:var :o;
      s:var :s;
].
\end{Verbatim}
\end{singlespace}

Note that the variable list is always placed \emph{after} the clause, due to the current implementation of the translator. The reasoner will accept the variable list before or after the clause without loss of functionality.

\subsubsection{ASK}

In an ASK query, the user may ``test whether or not a query pattern has a solution. No information is returned about the possible query solutions, just whether or not a solution exists.'' \cite{sparql} Thus, no variables are bound or output in an ASK clause. Of course, an ASK query may still explicitly refer to a particular field or data type which we may wish to regulate, so it is important to capture that pattern in the output.

Given this example query, with an ASK,

\begin{singlespace}
\begin{Verbatim}
PREFIX foaf:    <http://xmlns.com/foaf/0.1/>
ASK  { ?x foaf:name  "Alice" }
\end{Verbatim}
\end{singlespace}

The translator will generate the N3 output:

\begin{singlespace}
\begin{Verbatim}
@prefix s: <http://dig.csail.mit.edu/2009/IARPA-PIR/sparql#> .

:Query5293946291250696665 a s:SPARQLQuery;
s:clause [
  s:triplePattern  { :x <http://xmlns.com/foaf/0.1/name> "Alice"  };

].
\end{Verbatim}
\end{singlespace}

This is sensible, because even though we don't directly output any data from the database, we are asking about ``Alice'' and learning something about a foaf:name.

\subsubsection{DESCRIBE}

The SPARQL DESCRIBE construct is implementation dependent. A particular SPARQL endpoint may implement DESCRIBE as it sees fit. From a translation and policy standpoint, it is possible that DESCRIBE query will return information about the variables and triple patterns it describes. Thus, we treat DESCRIBE as we would a SELECT.

Given this example query, with a DESCRIBE,

\begin{singlespace}
\begin{Verbatim}
DESCRIBE <u> ?u WHERE { <x> <q> ?u . }
\end{Verbatim}
\end{singlespace}

The translator will generate the N3 output:

\begin{singlespace}
\begin{Verbatim}
@prefix s: <http://dig.csail.mit.edu/2009/IARPA-PIR/sparql#> .

:Query2577564471250696683 a s:SPARQLQuery;
<u>s:clause [
  s:triplePattern  { <x> <q> :u };

]; 
   s:retrieve [
      s:var :u;
].
\end{Verbatim}
\end{singlespace}

\subsubsection{Query Modifiers: ORDER BY, LIMIT, OFFSET, DISTINCT, REDUCED}

SPARQL supports a number of query modifiers that alter the output. ORDER BY serves to sort the output by its argument. LIMIT reduces the number of patterns that a query may bind. OFFSET works with ORDER BY and LIMIT to select a subset of the full output. DISTINCT and REDUCED change how a SPARQL query will handle duplicates.

These query modifiers form important parts of SPARQL's semantics. However, from a policy standpoint, their results are highly data dependent. Since the translation must be as pessimistic as possible, it simply drops these keywords from the output.

\subsubsection{OPTIONAL}

The OPTIONAL keyword serves to make a query pattern conditional. Normally, individual query triple patterns are mandatory, and all of them must be satisfied if they chain together. An OPTIONAL pattern is not mandatory. It will return results if they are present in the store, but will not fail if there is no data to support it.

Since our translator must take a pessimistic stance, we simply drop the OPTIONAL keyword from the translation. Our policies must be able to detect if there is \emph{any possibility} that a triple pattern will result in a binding. It is unimportant whether or not the pattern is mandatory if we assume that it will have a binding.

\subsubsection{UNION}

The UNION keyword allows the user to specify alternatives to a query pattern binding. This allows multiple results to be concatenated. The translator simply drops the UNION keyword, since policies do not need to use its semantics.

Given this example query, with a UNION,

\begin{singlespace}
\begin{Verbatim}
PREFIX dc10:  <http://example/1.0/>
PREFIX dc11:  <http://example/1.1/>

SELECT ?title
WHERE  { { ?book dc10:title  ?title }
         UNION
         { ?book dc11:title  ?title } }
\end{Verbatim}
\end{singlespace}

The translator will generate the N3 output:

\begin{singlespace}
\begin{Verbatim}
@prefix s: <http://dig.csail.mit.edu/2009/IARPA-PIR/sparql#> .

:Query15319361881250696825 a s:SPARQLQuery;

s:clause [
    s:triplePattern  { :book <http://example/1.0/title> :title };
    s:triplePattern  { :book <http://example/1.1/title> :title };


]; 
   s:retrieve [
      s:var :title;
].
\end{Verbatim}
\end{singlespace}

\subsubsection{Boolean Functions}

SPARQL FILTERs support a number of boolean operators. The translation changes those operators into terms defined in its namespace. This allows the writing of policies that can detect these operators. A future work is to declare these as equal to other operators in other ontologies. The current translation is in table \ref{bools}.

\begin{table}
\centering
\begin{tabular}{|c|c|c|}
\hline
Operator & Name & Translation \\
\hline
\Verb|-| & arithmeticNegation & :arithmeticNegation \\
\Verb|/| & arithmeticInverse & :arithmeticInverse \\
\Verb|+| & arithmeticSum & :arithmeticSum \\
\Verb|*| & arithmeticProduct & :arithmeticProduct \\
\Verb|&| & booleanConjunction & :booleanAND \\
\Verb+|+ & booleanDisjunction & :booleanOR \\
\Verb|!| & booleanNegation & :booleanNOT \\
\Verb|=| & booleanEQ & :booleanEQ \\
\Verb|!=| & booleanNE & :booleanNE \\
\Verb|<| & booleanLT & :booleanLT \\
\Verb|>| & booleanGT & :booleanGT \\
\Verb|<=| & booleanLE & :booleanLE \\
\Verb|>=| & booleanGE & :booleanGE \\
\hline
\end{tabular}% Make sure to include a screenshot of the translator.
\caption{Conversion of Boolean operators. The ``name'' refers to the title of the operator in the translator code. The ``translation'' is what the translator outputs, in the query translation namespace.}
\label{bools}
\end{table}

\subsubsection{Built-in Functions: STR, LANG, LANGMATCHES, DATATYPE, BOUND, sameTERM, isURI, isIRI, isLITERAL, REGEX, true, false}

SPARQL supports a number of built in functions for query processing. Their primary use is to serve as arguments to the FILTER keyword. Our translation does not support these built in functions at present.

\subsubsection{FILTER}

The FILTER keyword in SPARQL allows the user to restrict a query's output based on a condition. The condition may be a triple pattern, which we fully support and exemplify below. Alternately, the condition may include a built-in function, which we do not support; the most popular form if this is FILTER REGEX. In terms of writing policies, a FILTER counts as a ``use'' of a policy. In the case of a triple pattern, our translator drops the FILTER keyword, as its semantics are unimportant to the translator, and includes the triple pattern from the FILTER.

Given this example query, with a FILTER,

\begin{singlespace}
\begin{Verbatim}
SELECT * WHERE { FILTER (?o>5) . ?s ?p ?o } 
\end{Verbatim}
\end{singlespace}

The translator will generate the N3 output:

\begin{singlespace}
\begin{Verbatim}
@prefix s: <http://dig.csail.mit.edu/2009/IARPA-PIR/sparql#> .

:Query10909264011250696848 a s:SPARQLQuery;

s:clause [
  s:triplePattern  { :s :p :o };
  s:triplePattern  { :o s:booleanGT "5 "};

]; 
   s:retrieve [
      s:var :o;
      s:var :p;
      s:var :s;
].
\end{Verbatim}
\end{singlespace}

\subsubsection{GRAPH}

The GRAPH keyword restricts a particular pattern to only apply to unnamed (using the FROM keyword) or named (using the FROM NAMED keyword) graphs. If used in conjunction with a named graph, the result will note which graph contained the matching result. At present, the translation does not support the GRAPH keyword. It requires some further consideration, as it is difficult to express and complicates a query.

One possibility is to simply drop it; this would easily support all of the semantics of an unnamed graph query, and allow us to write policies that simply check which graphs we include with FROM or FROM NAMED. Another, more sophisticated, approach would be to generate several equivalent queries, in effect flattening the GRAPH keyword. With multiple uses of GRAPH in a query, this could easily grow exponentially.

\subsection{Lost in Translation}

As mentioned earlier, our original translation maintained more SPARQL semantics. This approach is deprecated in favor of the approach described here, but for completeness, we include a SPARQL query with translations under the old and new translations.

Given this SPARQL query as input,

\begin{singlespace}
\begin{Verbatim}
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX type: <http://dig.csail.mit.edu/2009/IARPA-PIR/generic#>
SELECT ?s ?b ?c WHERE {
  ?s type:b ?b.
  ?s type:c ?c.
}
\end{Verbatim}
\end{singlespace}

the first version of the translation would have yielded this output,

\begin{singlespace}
\begin{Verbatim}
@prefix type: <http://dig.csail.mit.edu/2009/IARPA-PIR/generic#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix math: <http://www.w3.org/2000/10/swap/math#>.
@prefix s: <http://dig.csail.mit.edu/2009/IARPA-PIR/sparql#> .
@prefix : <http://dig.csail.mit.edu/2009/IARPA-PIR/query1#> .

:Query-1 a s:Select;
    s:cardinality :ALL;
    s:POSList [
       s:variable :S;
       s:variable :B;
       s:variable :C;
     ];
    s:WhereClause  :WHERE.

    :WHERE a s:DefaultGraphPattern;
         s:TriplePattern  { :S type:B :B };
         s:TriplePattern  { :S type:C :C }.
        
#ends
\end{Verbatim}
\end{singlespace}

whereas the current version of the translator yields this output.

\begin{singlespace}
\begin{Verbatim}
@prefix s: <http://dig.csail.mit.edu/2009/IARPA-PIR/sparql#> .

:Query6088797851250696900 a s:SPARQLQuery;

s:clause [
  s:triplePattern  { :s <http://dig.csail.mit.edu/2009/
                     IARPA-PIR/generic#b> :b };
  s:triplePattern  { :s <http://dig.csail.mit.edu/2009/
                     IARPA-PIR/generic#c> :c };

]; 
   s:retrieve [
      s:var :b;
      s:var :c;
      s:var :s;
].
\end{Verbatim}
\end{singlespace}

The first version closely preserved most, if not all, of the SPARQL semantics. We found this approach to be cumbersome, as it required writing brittle policies that closely followed SPARQL query semantics. The old approach did not scale well, and did not lend itself to automated policy generation.

\subsection{Translator Summary}

The translator presented herein is novel, in that we know of no prior attempt to express a SPARQL query in N3 for any reason. The implementation of the translator offers insight as to exactly what parts of an arbitrary query give it structure and permit reasoning. We found that most of the SPARQL semantics simply led to more complicated policies, and were redundant.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



\section{AIR Policy Generation}

\subsection{Templates for Policy Generation}

The goal of this project is to make policy assurance accessible to a user with an intermediate level of technical understanding, without a thorough grounding in Semantic Web technologies. To this end, this project offers a number of templates that allow a user to directly create policies according to several fixed design patterns. With a list of attributes as an input, the policy generator will create a valid AIR policy to facilitate reasoning.

All of these templates work on the level of single attributes. Furthermore, the policies that the templates generate rely on matching very particular triple patterns. Though users can represent many policies with a combination of one or more templates, there are policies that the templates cannot describe. These patterns cover some basic cases, and provide a starting point for more advanced policies. A description of some of the common design patterns follows. An advanced, AIR-savvy user is still able to create their own policies.

\subsection{Supported Policy Types}

At this time, this project defines five separate kinds of policies. These policies are designed to be templates, to help an administrator create or implement more sophisticated policies. These policies do not necessarily exhaust the space of policies that this system can represent, but serve as a basis for many policies. Each policy works by matching a particular variable type or attribute, and taking action when it finds that type. The five policies, in no particular order, perform as described on variable types that a user enters.

\begin{itemize}
\item \textbf{Restriction}. Blocks access to a variable type.
\item \textbf{Inclusion}. Lumps variable types together, so a user must access either all of them or none of them.
\item \textbf{Exclusion}. Allows a user access to all but one of the specified variable types.
\item \textbf{Chaining}. Accepts a list of variable types. If the user mentions the first variable type, chaining takes action if it sees any of the following variable types.
\item \textbf{Default Deny}. Restricts access to the specified variable types.
\end{itemize}

The following sections describe the implementation details of each of these policies.

\subsubsection{To USE and to RETRIEVE}

All of the policy templates in this section implement patterns which detect two common usage patterns, \Verb|USE| and \Verb|RETRIEVE|. To \Verb|USE| an attribute is to pull it from a database and make decisions based on its value. To \Verb|RETRIEVE| an attribute is to pull it from a database and display it directly to the user. It is possible to \Verb|USE| an attribute without a \Verb|RETRIEVE|, and vice versa.

When a user submits a query to \Verb|USE| a variable, the policy detects this by searching for a particular triple pattern:

\begin{Verbatim}
:W s:triplePattern :T;
:T log:includes { [] example:age :V };
:W s:triplePattern :U;
:U log:includes { :V [] [] }.
\end{Verbatim}

The policy is looking for a pair of triple patterns to detect usage. In the first triple pattern, a variable is the object of a triple pattern, with an attribute as the predicate. In the second triple pattern, the variable that was the object of the first pattern is now the subject. This ascertains that a user bound a variable to a certain attribute, and is trying to use that same variable for further processing. As an example, the following query would match this pattern.

\begin{Verbatim}
# Sample SPARQL query that queries for SSN, age, OpenID.
PREFIX example: <http://example.com/#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT * WHERE {
     ?s example:ssn ?n. ?s foaf:age ?a. ?s foaf:openid ?id.
     FILTER (?a > 18) }
\end{Verbatim}

This query, translated into N3 with the aforementioned converter, appears as follows.

\begin{Verbatim}
@prefix s: <http://dig.csail.mit.edu/2009/IARPA-PIR/sparql#> .

:Query-1644923176 a s:SPARQLQuery;
   s:retrieve [
   ];
s:clause [
  s:triplePattern  { :s <http://example.com/#ssn> :n };
  s:triplePattern  { :s <http://xmlns.com/foaf/0.1/age> :a };
  s:triplePattern  { :s <http://xmlns.com/foaf/0.1/openid> :id };
  s:triplePattern  { :a s:booleanGT "18 "};

]; 
   s:retrieve [
      s:var :a;
      s:var :id;
      s:var :n;
      s:var :s;
].
\end{Verbatim}

In particular, the \Verb|USE| policy would find two triple patterns of interest, indicating that the user's query is trying to \Verb|USE| something with an attribute of \Verb|foaf:age|:

\begin{Verbatim}
s:triplePattern  { :s <http://xmlns.com/foaf/0.1/age> :a };
s:triplePattern  { :a s:booleanGT "18 "};
\end{Verbatim}

A user may \Verb|RETRIEVE| an attribute independently of trying to \Verb|USE| it. When trying to \Verb|RETRIEVE|, a user will submit a query that binds an output variable to a particular attribute. In other words, the user submits a SPARQL query for verification, with a \{subject, predicate, object\} triple in which the predicate is a type (specifically, a URI, like  \Verb|example:dob|), and the object is a variable, like \Verb|:V|. The reasoner uses this information to decide, ``\Verb|:V| is a variable that will match things of type \Verb|example:dob|.'' The policy looks at variables in the context of attributes in the database, instead of the actual, bound values within the database itself, since our policies never look inside the database.

 The following code demonstrates an AIR pattern that will catch a user trying to \Verb|RETRIEVE| an attribute with a type \Verb|example:dob|.

\begin{Verbatim}
:P s:var :V;
:W s:triplePattern :T;
:T log:includes { [] example:dob :V }
\end{Verbatim}

To \Verb|RETRIEVE| an attribute, a user must explicitly \Verb|SELECT| a variable (or \Verb|SELECT *|), and bind it in a triple pattern as the object, with the attribute as the subject. In the aforementioned query, a user's attempt to \Verb|RETRIEVE| an attribute of type \Verb|example:ssn| is evident with these N3 statements.

\begin{Verbatim}
s:retrieve [
      s:var :n; ]
s:clause [
  s:triplePattern  { :s <http://example.com/#ssn> :n }; ]
\end{Verbatim}

The important distinction between \Verb|RETRIEVE| and \Verb|USE| is that the former allows a user to access data in a data set directly, while the latter does not. Once a user has seen a particular piece of data, the policy assurance framework considers it to be ``public''. There may be incentives in place for the user to control data retrieved from a data set, but those incentives are external to the policies.

The distinction between \Verb|USE| and \Verb|RETRIEVE| is intended for honest, but curious, users, to help keep them honest. An attacker could easily write a series of \Verb|USE| policies that return data that is equivalent to a \Verb|RETRIEVE|. For example, the author's calculations based on publicly available Social Security Administration data shows that there are 746,347,500 possible Social Security numbers. If a policy prevents a user from directly seeing a Social Security number in a database, the user could simply exhaust the SSN space using \Verb|USE| queries. An administrator can specify policies that limit the number of times a user can access a particular attribute. A true security approach with a malicious user in mind is an important future work, well outside the scope of this thesis.

With these common access patterns defined, it is possible to write template policies that can flag these accesses.

The discussion of sample policy templates uses possible fields from a database of personal information. Specifically, it uses \Verb|example:name| for a person's name, \Verb|example:age| for a person's date of birth, \Verb|example:dob| for a person's date of birth, and \Verb|example:ssn| for a person's social security number. In practice, a system of real-world policies would require multiple AIR policies; an important future work would help a user with this task.

The template policies are restriction, inclusion, exclusion, history-aware exclusion, chaining, and default deny. A description of each template policy follows. Samples of all of these policies are available in the appendix. Furthermore, tables which demonstrate how these ideas become policies appear at \ref{translator-table-use} and \ref{translator-table-retrieve}. \endnote{You may also rotate your head or your entire computer to see the tables in their correct orientation. The author is not responsible for any damage from this process!}

\subsubsection{Restriction}

A restriction policy disallows a user from performing a USE or RETRIEVE operation (or both) concerning any data with a particular attribute. The policy implicitly implies that a user can see all of the data in a data set, with the exception of the restrictions put in place by a restriction policy. Restriction is most helpful for controlling the way a user interacts with data of a particular attribute.

The sample restriction policy in appendix section \ref{sample-restriction} states that a user:

\begin{itemize}
\item May not \Verb|RETRIEVE| any attribute of type \Verb|example:name|.
\item May not \Verb|USE| any attribute of type \Verb|example:age|.
\item May not \Verb|RETRIEVE| any attribute of type \Verb|example:dob|.
\item May not \Verb|USE| any attribute of type \Verb|example:ssn|.
\end{itemize}

A user may, for example, \Verb|USE| an attribute of type \Verb|example:name|, since it is a pattern that the policy does not mention.

The policy runner will assert that the user is not in compliance with this policy if it finds any one of those usage patterns in a query. It will return the decision, with an explanation. The policy runner does not evaluate patterns in any particular order. In practice, a restriction policy would be most useful with a default deny policy in place.

\subsubsection{Inclusion}

An inclusion policy requires that a user tie together certain actions. Inclusion implements an all or none policy that is only available using two or more attributes. An administrator might use this policy to ensure that two fields in a data set are always coupled.

The sample restriction policy in appendix section \ref{sample-inclusion} states that a user's query is compliant if the query will:

\begin{itemize}
\item \Verb|RETRIEVE| any attribute of type \Verb|example:name|, AND
\item \Verb|USE| any attribute of type \Verb|example:age|, AND
\item \Verb|RETRIEVE| any attribute of type \Verb|example:dob|, AND
\item \Verb|USE| any attribute of type \Verb|example:ssn|.
\end{itemize}

If a query contains any subset of these patterns, but not all of them, the query is incompliant. If the query does not contain any of these patterns, it is compliant.

In practice, an inclusion policy likely has little use. A policy that requires a mandatory filter, such as an age filter, will use chaining and not inclusion.

\subsubsection{Exclusion}

An exclusion policy is, in effect, the inverse of an inclusion policy. It allows an administrator to tie together a number of patterns. Under an exclusion policy, a user's query (or query history) may contain ``all but one'' of the patterns in the policy.

The sample exclusion policy in appendix section \ref{sample-exclusion} states that a user's query is not compliant if the query will:

\begin{enumerate}
\item \Verb|RETRIEVE| any attribute of type \Verb|example:name|, AND
\item \Verb|USE| any attribute of type \Verb|example:age|, AND
\item \Verb|RETRIEVE| any attribute of type \Verb|example:dob|, AND
\item \Verb|USE| any attribute of type \Verb|example:ssn|.
\end{enumerate}

A user's query may access all but one of these patterns with an AND relation, or none of them at all. As an example, the user can RETRIEVE \Verb|example:name| \emph{and} RETRIEVE \Verb|example:dob| \emph{and} USE \Verb|example:ssn|, either in one query or in successive queries if history is enabled. If the user does this, they will not be able to USE \Verb|example:age|. Exhaustively, if we use the notation $\{a \bigwedge b \bigwedge \ldots\}$ to express a user trying to access attributes from the above list, the user may NOT do $\{1 \bigwedge 2 \bigwedge 3 \bigwedge 4\}$, but may do any subset of that, including $\{\}$ (nothing).

Exclusion is the only basic policy construct for which it makes sense to use a history. The sample policy in appendix section \ref{sample-exclusion-history} demonstrates how a policy needs to change in order to support query history. In a break from all other policies, a history-aware policy needs to have an idea of which is the current query, and which are the past queries. This is important, since AIR, a production rule system, does not support any concept or order or counting. A history-aware policy will assert non-compliance if it matches all of the patterns in the policy, regardless of whether those patterns are contained solely in the history, solely in the current query, or in some combination of current and past queries. A full explanation of query history follows in the next section.

Originally, this project sought to include a $MAX(m,n)$ construct. Exclusion implements a $MAX(n-1,n)$ policy, where a user can see at most $n-1$ out of $n$ patterns. Implementing the more general $MAX(m,n)$ for $m \leq n$ and $n>0$ involves combinatorics and does not scale well. Finding an efficient, scalable implementaiton of general $MAX(m,n)$ is an important future work.

\subsubsection{Chaining}

The chaining policy allows an administrator to implement policies in the form IF \ldots THEN \ldots; to detect the presence of a pattern and determine compliance or non compliance based on the presence of future patterns. A chaining policy can either be default compliant (in which case a pattern match asserts non-compliance), or default non-compliant. The code for the chaining policy generator supports \Verb|RETRIEVE| and \Verb|USE|, as well as filter patterns.

The sample chaining policy in appendix section \ref{sample-chaining}, a default non-compliant policy, will try to match the first pattern. If it succeeds, it will make an assertion if it finds any of the successive patterns, and will go to the default if it finds none of them. It walks though the patterns one after the other, as links in a chain. The sample policy states:

\begin{itemize}
\item If a user wishes to \Verb|RETRIEVE| any attribute of type \Verb|example:name|, the user must
	\begin{itemize}
	\item \Verb|USE| an attribute of type \Verb|example:age|, OR
	\item \Verb|RETRIEVE| an attribute of type \Verb|example:dob|, OR
	\item \Verb|USE| an attribute of type \Verb|example:ssn|, OR
	\item Include a filter of the form \Verb|example:age| $> 18$.
	\end{itemize}
\end{itemize}

More generally, chaining specifies a policy of the form ``if A matches, take action if B or C or \ldots is found.'' A, B, C, \ldots can be either a USE or RETRIVE, and ``take action'' is either ``assert compliance'' or ``assert noncompliance''. To match a pattern that is not some form of USE or RETRIEVE, the administrator would need to code this manually; there is no facility in the user interface for the input of a pattern. Furthermore, since the reasoner only looks at the queries to a system, and not their results, making policies that fire if certain data is in the database is not possible. As an example, consider a policy that says, ``you may not RETRIEVE any attribute of type \Verb|example:name| if the value of name is John.'' We can't check to see that a name has value John, so we can't use exclusion here. What we might do instead is to create a policy that requires a user to specify ``\Verb|example:name| $!=$ John'' in every query they submit.

Chaining is helpful when an administrator would like to provisionally provide access to a certain attribute. With a chaining policy, the administrator can require a particular filter or other pattern to be present in order for compliance. Since this works at the granularity of a single query, using a history does not make sense for this policy!

\subsubsection{Default Deny}

The last of the primitives for automatic policy generation, default deny allows an administrator to impose limitations on all queries. Whereas all of the previous policies made no assumption about the size of the world by assuming an open world, default deny imposes strong restrictions by closing the space of possible queries to only match a few particular patterns. This is analogous to the \Verb|GRANT| feature of SQL databases, which explicitly enables permissions to a particular field. This policy family is called ``default deny'' because any pattern that it does not expressly enumerate is, by default, not compliant with the policy.

The sample default deny policy in appendix section \ref{sample-defaultdeny} states that a user's query is compliant only if every single triple pattern in its translation matches one of the following patterns, where $[\hspace{1em} ]$ is a wildcard that can match anything:

\begin{itemize}
\item $[\hspace{1em} ]$ \Verb|example:name| $[\hspace{1em} ]$
\item $[\hspace{1em} ]$ \Verb|example:age| $[\hspace{1em} ]$
\item $[\hspace{1em} ]$ \Verb|example:dob| $[\hspace{1em} ]$
\item $[\hspace{1em} ]$ \Verb|example:ssn| $[\hspace{1em} ]$
\end{itemize}

Implicitly, this policy allows a user to \Verb|RETRIEVE| or \Verb|USE| any of these attributes, to use filters, or to peform any operation that specifically matches one of those four patterns.

In practice, a default deny policy can set the ground rules for access to a particular data set. By explicitly giving permissions to certain triple patterns, the administrator can guarantee closure on the policies. In conjunction with the other template policies, an administrator can offer a level of access between `all' and `none'.

% The table to illustrate policy generation is complicated.
% Put it in another file and input it here.

\subsection{Automatic Policy Generation} % later

The table that follows demonstrates how the aforementioned policy templates generate policies based on user input. The purpose of the table is to show how the generation of all policies is similar, and how each policy scales with multiple variables. The author prepared these tables to provide a reference for implementing each policy template.

There are two tables: one for USE policies, and one for RETRIEVE policies. Along the top row of each table are the five policy types. Each table is divided in half: the top half of each table demonstrates what happens if a user enters a single variable type or attribute (say \Verb|type:A|), and the bottom half shows what happens if a user enters two or more variable types or attributes (say \Verb|type:A|, \Verb|type:B|, and possible \Verb|type:C|). The three rows - AIR Rule, Description, Action - explain how a policy gets generated, and what it does. The AIR Rule(s) row shows the snippet of AIR code that forms the important part of the policy. Some policies require several AIR rules to express their functionality, particularly those with multiple variables. The table illustrates AIR code that would appear in separate rules by separating it with a blank line (two newlines). In reality, each policy contains a large amount of boilerplate code, but this snippet really defines the policy. The Description row offerse a terse explanation of the effect of the policy, and the Action row shows what action the reasoner takes when it matches the code in the Rule row. All of these policies depend on a boilerplate that defines \Verb|:Q| as a \Verb|s:SPARQLQuery|, with a \Verb|s:retrieve| part \Verb|:P| and a \Verb|s:clause| part \Verb|:W|.

As an example, consider a policy where an administrator wants to create an Inclusion policy, which specifies that a user must either USE \Verb|type:A| and \Verb|type:B| together in a query, or not at all. The chart demonstrates how the policy generator will create this policy. This is a USE policy, so the template is listed in the USE table. It is an Inclusion policy, which is the third column from the left. This policy mentions multiple variables, so the information that describes it is in the bottom half of the table. In particular, three cells describe the AIR Rule(s), the Description, and the Action of this policy. The AIR Rule(s) row tells us that this policy will have three separate AIR rules that perform pattern matching: one that looks for both \Verb|type:A| and \Verb|type:B| used together, and one each looking for \Verb|type:A| or \Verb|type:B|. The Description row tells us what the policy tries to accomplish, and the Action row describes the action the reasoner takes upon a match between the policy and the query.

Some policies support a ``hybrid'', whereby a policy can act on a query that tries to USE some attribute(s) and tries to RETRIEVE some other attribute(s). These cases are listed in the Notes row of the USE table. Some policy types do not make sense, so the table marks them as unsupported.

\input{translator-table.tex}

\subsection{Query History with check-compliance}

This project supports query histories. The history is a record of every query that a user has made, that can be used when checking for compliance with a query. In fact, exclusion is the only policy for which it makes sense to use the query history; moreover, exclusion should always use the query history. It is important to note that the query history is not always helpful. Restriction never needs it, since single queries simply fail. Inclusion does not need it either, since one query must always contain all of the terms. Chaining should leave history disabled, since having used a condition in the past doesn't guarantee that the current query is compliant. Default deny never needs it, since its behavior is so similar to restriction.

We implement query history as follows.

\begin{enumerate}

\item User calls ``check-compliance POLICY NEW-QUERY QUERY-DESC'' wrapper script. POLICY is an AIR policy; NEW-QUERY is a query in n3, perhaps the output of the sparql2n3 script, and QUERY-DESC is a file like query-req.n3 that defines the location of the query and history files.
\item Wrapper script appends NEW-QUERY to the history file in QUERY-DESC, say, history.n3. This accomplishes three things:
\begin{itemize}
    \item we only need to look at QUERY-HISTORY when reasoning, making it much easier to automatically generate policies that only grow linearly with the number of terms.
   \item AIR will explicitly know which query is the current (new) query.
   \item This guarantees that the QUERY-HISTORY is never empty, eliminating a tricky edge case.
\end{itemize}
\item Wrapper script writes NEW-QUERY to a file, say, query.n3.
\item Wrapper script calls policyrunner.py, and returns the output of policyrunner.py for viewing.
\item Optional. If policy is non compliant, we can remove it from the history file.

\end{enumerate}

This approach is sufficient, and preserves correct AIR semantics of the query and the history.

\subsection{Policy Generation User Interface}

\begin{figure}
\centering
\includegraphics[scale=0.5]{ui-firstdemo}
\caption{A screenshot of the policy generator, courtest of Yotam Aron.}
\label{demo-ui01}
\end{figure}

The template policies defined above allow a user to easily generate policies. In order to do so, a user interacts with the policy generator. Depending on the policy, a user inputs variables to catch, whether to act on a USE or a RETRIEVE, a policy name, a policy description, and if necessary, whether or not the query ought to be history enable.

\subsection{Compliance Testing and Browser Presentation in Tabulator}

Tabulator is a Firefox browser plugin that enhances Semantic Web data. In this project, it performs two functions. One, it organizes the display of a policy and the of the policy runner, in a form that is readable hy a human.  Figure \ref{query-ui01} demonstrates this functionality. Tabulator also makes it easier for a user to understand the reasoning generated, as figure \ref{query-ui02} demonstrates.

The majority of the features discussed in this section are the work of Oshani Seneviratne, as it falls under the umbrella of her ongoing Tabulator work. Screenshots of the Tabulator UI are in figures \ref{query-ui01} and \ref{query-ui02}.

\begin{figure}
\centering
\includegraphics[scale=0.5]{query-ui01}
\caption{Tabulator browser presentation of the MIT Prox Card policy.}
\label{query-ui01}
\end{figure}

\begin{figure}
\centering
\includegraphics[scale=0.5]{query-ui02}
\caption{Tabulator justification user interface.}
\label{query-ui02}
\end{figure}

\subsection{Implementation Note}

With so many tools written in different languages and offering very specialized functionalities, it was necessary to write some simple tools to tie them all together. The scripts provide additional functionality, such as query conversion from SPARQL to N3, query history functionality, and template-based policy generation. Various scripts power the translator, the compliance checker, the policy generator, and user-facing Web pages. We discuss the functionality and design choices of those wrappers here; their full source is in the appendix.



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% \section{Query Based Security}

% TODO (compare to other database security strategies - including the pessimistic assumption. maybe later.)

% \section{Designing and Writing Policies}

% TODO Discuss how we generate policies. Examine details of policies in appendices.


\section{Summary}

This chapter explained the functionality of the system. It can convert SPARQL queries to N3. It defines a set of primitive policy types. It allows a user to easily create queries, possibly using information about an existing database. It can enable logging of queries. It allows a user to check a query against a policy for compliance. Finally, it allows a user to understand the compliance explanation generated by a policy runner. In sum, this is a complete, end to end system for policy assurance, that is ready to be integrated into an RDBMS.

All of the code in this section and the appendix are available in the MIT CSAIL De\-cen\-tral\-ized Information Group's code repository. Please contact a member of DIG for access.