\chapter{Introduction}

Databases are a common way of storing data, and SQL is arguably the most popular variety of database in production as of this writing. The most popular SQL databases implementations, Oracle and MySQL, offer a rudimentary set of security options. SQL has a table-centric view of data, and a UNIX-like security model. \cite{mysql} A database super-user can create individual tables, create user accounts, and grant or deny access to certain tables to certain users.

The existing state of the art in database security is administrator-centric and based on security models that are decades old. While adequate for many applications, these systems have limitations. They provide an ``all or nothing'' model of security, with no ability to show which rules provided an output or a specialized, data centric policy system. More information about the conclusions that a policy system reached is helpful. It promotes transparency, helps find errors, helps keep users honest, and helps to promote more fine-grained access policies. I expand on these systems with the use of Semantic Web technologies. In particular, I offer policy assurance, which provides a guarantee that a database query was or was not in compliance with a policy. Policy assurance provides an exact description of why a query is or is not compliant. 

To securely grant access to a piece of information, data security engineers typically ask the following questions, or make assumptions as necessary:
\begin{itemize}
    \item Who are you?
    \item What information are you trying to see?
    \item Why do you want to see this information?
\end{itemize}

Electronic systems traditionally utilize a few common design patterns in their implementations of security primitives. Discretionary access control \cite{dac-db} is the most familiar approach, and the basis of the UNIX security model. A user or super-user can expressly allow or deny other users certain permissions to certain pieces of data. Mandatory access control \cite{Qiu85trustedcomputer} requires explicit permission for any action, granted by an outside body; for a time, this was a popular design pattern for government systems containing classified data. Role-based access control \cite{rbac} groups users into specific roles, which themselves have specific permissions. Rule based access control \cite{rulebased} allows an administrator to create more specific policies for governing access.

This thesis documents a database security system built using Semantic Web technologies and design principles. I argue that a query to a database contains a substantial amount of information about what a user is trying to access, and why they wish to do so. I design and implement a system that implements rudimentary database security, and provides explanations for the policies it enforces. Our model user is ``honest but curious'': a trusted, well-intentioned employee of a government agency who is using a database, provided under numerous policies and agreements, to the user's agency.

The thesis introduces some key contributions to the field. The SPARQL to N3 conversion tool and related ontology are novel. The definitions of template policies, and tools for creating them, reduce policy based security to a few primitives. Policy assurance provides more transparency to users and administrators alike.

I approached this project in two phases. In phase one, I used SWObjects to convert SPARQL queries to an RDF serialization. I wrote low level AIR policies to check the compliance of these SPARQL queries, and demonstrated some test cases. In phase two, I created an abstraction for these SPARQL queries, and provided a meta level that removed dependence on the data structure.

\section{Motivating Example}

\begin{quote}
\emph{``Policy Assurance'' is a set of technical mechanisms that enable effective and accountable information sharing and usage.}\cite{info-account}
\end{quote}

``Policy assurance'' is a process by which we can be anywhere from reasonably to absolutely certain that actions in a system are in compliance with an existing policy. To validate the claim of compliance (or noncompliance), there is some data set which validates the claim.

As a sample scenario, we consider a database that contains highly sensitive data and, as a result, is only accessible to a limited number of people. For this possible database, we assume that the database administrator is not the database owner, and thus, cannot see either the contents of the database or the queries that external parties make to the database. The database owner may only see the information in the database, but not the queries made to it. Policy assurance allows the database owner and database administrator to verify that every query made to the database was in compliance with the given policies, without exposing any sensitive information.

The key point of policy assurance is that, if an external party knows the policies and has access to enough information about each policy check, the external party can verify the compliance of the policies. If there is a guarantee that every single access is logged, then an external party can validate a claim of complete compliance. This promotes transparency, by allowing verification, while providing security, which we might define as ``minimize the information required to permit verification.''

\subsection{Sample Usage Scenario}

In this thesis, we will consider a simple usage scenario of an administrator who wishes to guard data in an RDF data set which users access using SPARQL queries. In this scenario, users may not access location information about entities in the datbase; specifically, they may not retrieve or use address information in any form. The administrator may not see the queries that users are making, for security reasons, but must be certain that \emph{every single query} made to the data set is compliant with the policy about addresses.

The system described in this thesis is well suited to this task. In later sections, we will describe how the administrator would create a policy to enforce this, and how the administrator would check some sample queries manually. We will then describe how this system could be configured to perform this task automatically, providing policy assurance for the administrator and security for the users.

\section{System Components}

In this thesis, I present a system that provides policy assurance for SPARQL databases by checking whether incoming queries conform to authorization policies written in AIR. The system uses query based security to draw its conclusions. This system has several components:

\begin{itemize}
	\item A translator, which generates an N3 output from a SPARQL query. This output serves as an input to a reasoner.
	\item An ontology which serves to define the N3 output of the translator.
	\item A number of pre-defined policy templates in AIR, which capture the behavior of some of the most common policy design patterns. Some policies support reasoning over a log of previous queries.
	\item An algorithm for automatically generating AIR policies by combining user inputs with policy templates.
	\item A Web interface for performing SPARQL to N3 translation.
	\item A Web interface for generating policies.
	\item A Web interface for checking queries against policies.
\end{itemize}

All of the Web interfaces in this thesis work with Tabulator's Justification UI, for user-friendly viewing of queries and policies.


\begin{figure}
\centering
\includegraphics{system-drawing-1}
\caption{Architecture of our policy assurance reasoner, demonstrating separation from the RDBMS.}
\label{system-drawing}
\end{figure}

\section{Outline}

The remainder of this thesis is structured as follows. The following chapter provides a demonstration of the system, elaborates on the topic of policy assurance, and provides an overview of how this system is useful in implementing policy assurance. Following the overview, I discuss the design and implementation of each component of the system in detail, with a discussion of the design assumptions and trade-offs necessary for implementation. Then, I explore the performance of the reasoner. I then look at future directions for this work, and draw comparisons with other work in the field, before coming to a conclusion.

The appendix chapters provide background information. I discuss the ongoing Semantic Web initiative and the technologies which provide the infrastructure for this work. I offer code samples and useful information about the project code.