\chapter{Related and Prior Work}

This chapter presents some of the prior methodologies in database security, as well as other uses of Semantic Web technology in this field. There is some overlap between this and preceding chapters.

\section{Policy Awareness}

The Information Accountability paper \cite{info-account} describes the arenas in which policy-aware systems are if interest.  Among these are privacy, copyright, surveillance, and data mining. The paper argues that if we can trace who uses data and how they use it, if we have a way of finding accountability, we can begin to move toward policies that are between full disclosure and full control. The paper offers scenarios of how information aware systems would be useful in everyday life.

REIN \cite{rein} (Rei and N3, named after the Rei policy specification language and the Notation 3 language) describes a system offering ``policy management''. The system is policy-language agnostic, and operates in an independent environment, offering a yes/no opinion of policy compliance to an existing system. The REIN paper defines three possible operating modes. In the server-side rules operating mode, the analogy of choice is an application for a library card. The user does not know the rules, but supplies the information requested on the form to the library for processing. In the client-side rules mode, the onus of rule checking is on the user. The hybrid mode offers a way of sharing this responsibility. At present, REIN is implemented in Python, using the N3 policy language and the CWM reasoner.

There is prior work demonstrating the use of a policy-aware system for checking license information. \cite{creative-commons} \cite{Oshani:Thesis:2009} In this case, the system uses Creative Commons license policies to check for compliance with the content creator's sharing preferences. This system is a way of ``keeping honest people honest.'' The system, as implemented, converts meta data on an image file to an AIR policy, and checks usage patterns against a set of rules that parallel the Creative Commons license policies. If, for example, an image from the Flickr Web site is used on someones personal page, in violation of the content creator's wishes as expressed in their choice of Creative Commons licensing, this tool will detect a conflict.

\section{Methodologies of Access Control}
% talk about DAC/MAC/RBAC here

It is instructive to look at the approaches systems designers take in designing traditional access control systems. Systems designers tend to have a different mind set when approaching security problems than Web designers: systems designers tend to favor closed systems, whereas Web designers tend to favor open systems. Existing relational database systems extend on these technologies in their implementation of access control.

\subsection{Mandatory and Discretionary Access Control}

The traditional models of access control, circa the mid 1980s, consist of mandatory access control, or MAC, and discretionary access control, or DAC. The origins of MAC lie in secure military systems, as a way to enforce permissions on proprietary, secured, confidential data. \cite{Qiu85trustedcomputer} In an MAC system, the only way to gain access to data is to expressly have an outside authority grant access. There is absolutely no provision for an end user to alter permissions on data. The primitives for enforcing access control extend deep into the design of the operating system. MAC systems explicitly define rings of access. Some secure Unix operating systems, including SELinux, utilize MAC design principles.

%MAC systems lend themselves to formal verification techniques. - verify this

Discretionary access control is a design principle that allows a user to show that they have the credentials to access data. \cite{Qiu85trustedcomputer}Users have the capability of modifying their own permissions, or the permissions of others. A widespread example of a DAC system is that of Unix style permissions. On Unix systems, everything, from devices to network sockets to data on a disk, is represented as a file. Every file has three sets of permissions: those for the owning user, for the owning group, and for everyone else. Each of these sets may be granted (or denied) permissions to read, write, and/or execute a file. The operating system checks the user's credentials and the file's permissions before enabling an operation. Access Control Lists (ACLs) are another example of a DAC system. There is prior work investigating the use of DAC in object-oriented databases. \cite{dac-db} Such a security model works well for general purpose computing, but not in all scenarios.

The excessive rigors in the implementation of a MAC system, and the inadequacy of a DAC system, have led systems researchers to newer design philosophies that allow finer granularity in security settings.

\subsection{Role Based Access Control}

In the early 1990s, systems researchers began to find practical limitations in the expressive power of the discretionary access control model. Researchers presented role based access control as a more secure, easy to implement alternative to discretionary access control. \cite{rbac} Role based access control is a form of mandatory access control, in that users obtain their permissions from an outside authority. The primary difference is in the structure of permissions. In role based access control, a user can be assigned to one (or more, in some implementations) ``roles''. As an example, a user in a large company might wear several hats throughout their time in the company, from ``human resources associate'' to ``financial associate'' to ``systems analyst''. Each of these roles requires a different set of permissions. The formal description of role based access control, demonstrated in the paper, shows the power and flexibility of assigning roles rather than individual permissions. It is particularly well suited to assigning permissions where a high user turnover rate is present.

\subsection{Rule- and Policy-Based Access Control}
% Talk about REIN here. Talk about SWRL, too.

Rule based access control is an extension and a generalization of role based access control. Instead of simple roles, we can use rules and logic to derive whether or not a particular agent should have access to a particular resource. As an example, consider a student trying to access a secured Web page. The student may have credentials saying, ``I am a graduate student'', ``my adviser is Professor Smith'', and ``I am a member of Research Group X''. The page may have a set of rules defining access: it may be restricted to individuals who are members of certain research groups, and also graduate students. Different versions of the page may exist based on the credentials used. REIN \cite{rein} and SWRL \cite{rulebased} provide sample implementations of policy based access control. This thesis mimics the RBAC approach, using a similar methodology to define and process rules.

\section{Prior Work in Relational Databases}

This project seeks to address a particular shortcoming in database systems at present. In order to enforce access control policies, the database must be able to see the queries that a user is making against the database. This poses a problem when the queries themselves may reveal sensitive data. Access control lists are insufficient in this regard, whereas misuse and intrusion detection efforts are more closely related with our work.

\subsection{Access Control Lists}

Modern relational database management systems utilize some form of discretionary or role-based access control to regulate access to tables, using some form of access control list. The database administrator may restrict database access to particular rows or columns in particular tables, may grant or deny read and write permissions, and may limit the user to executing particular queries. \cite{mysql} In effect, all of these systems limit the user to seeing a particular subset of the data in the database.

It follows that the structure of the access control policy must closely follow the structure of the data itself. The structure of the security policy may divulge secure information, and may not be tolerant of changes to the underlying structure of the database.

\subsection{Access Control Features In A Modern RDBMS}

Modern RDBMS packages, such as Oracle, offer a wide variety of access control features. We can roughly group them into five categories. \cite{acl-oracle} \cite{oracle} \cite{rbac-commercial}

\begin{itemize}
	\item{Privileges}
	\item{Views}
	\item{Stored Procedures}
	\item{Roles}
	\item{Virtual Private Databases}
\end{itemize}

Privileges explicitly grant access to perform an action on a particular database object. A database administrator, or a user with appropriate privileges, can grant or revoke privileges. A view is a dynamic table, the output of a particular query against the database. A stored procedure is a particular command, like a user defined function, provided for database access. A role, as described in the preceeding section on role-based access control, is a type of meta-user that helps congregate database permissions. A virtual private database is a form of information hiding, partitioning the database such that a user only sees certain data.

These systems are able to specify access rights for users and groups on databases, and work well enough in the majority of usage cases. However, this design is not without its limitations. The access policies must closely follow the structure of the data, in many cases involving the database administrator to know a great deal about the structure of the tables. Changes to the layout of data will break previously defined policies. Privileges can be granted or revoked in a conflicting fashion, often making it difficult to determine correct access. Views and virtual private databases may be overly restricting, preventing honest individuals from carrying out their responsibilities. Our work seeks to address the numerous limitations with systems of this design by introducing a more flexible, more abstract means for defining permissions.

\subsection{Misuse and Intrusion Detection}

There is prior work in the field of intrusion detection systems in databases. Intrusion detection researchers argue that, though an outsider threat is very real, current databases do little to guard against violations of policy conducted by a trusted insider with database access. Detection of Misuse in Database Systems (DEMIDS) \cite{demids} was one of the first systems to detect potential misuse in a database system. By monitoring queries, using audit logs, and determining a frequent item set, the DEMIDS system is able to guess whether or not a particular access by a particular user is likely to be permitted. This approach suffers from a granularity issue: with too high or too low granularity, it is likely that the system will see a high number of false positive policy violations, or respectively, a high number of false negatives.

Later systems expand on some of the DEMIDS ideas, adding better learning or classification techniques. Cathey et. al. \cite{Cathey03misusedetection} developed a system that uses a combination of user query profile learning, abnormal query behavior detection, clustering documents, clustering query results, and relevance feedback to better understand how particular users and groups access a database. Over time, Cathey's system will ``learn'' common access patterns. Kamra et. al. \cite{Kamra} extend this theme further, requiring the use of role-based access control, and using a naive Bayes classifier to learn standard access patterns. Their approach argues that, by only considering roles instead of individual users or tables, the granularity is correct. Kamra's paper also argues against a formal definition of intrusion detection, saying that the intrusion detection application is much more purpose-built, and therefore less suited to a formal approach. The aforementioned techniques would be helpful in the initial instrumentation of rule sets, by finding common ways that users access data.

\section{Alteration of Data}
% discuss things like k-anonymity and de-identification here

An orthogonal approach to security of certain kinds of sensitive data is the alteration of the data itself. If we restrict the utility of the data through alteration, we can, in theory, limit the usefulness of the data, its sensitivity, and its potential for damage. The classic example of this approach is the use of a heavy black marker to manually censor sensitive government documents prior to public release. There has been some research into formalizing and automating this process.

% TODO: Talk more about Latanya Sweeney's work.

Sweeney has published a great deal of research on this particular topic. k-anonymity \cite{kanon} \cite{kanon2} offers a formal approach to the de-identification process. A data set that offers k-anonymity for some k is constructed such that, for every attribute (column), there are at least k occurrences of that attribute (rows). Phrased differently, a data set of census data  would be anonymous for $k=2$ if there were at least two occurrences of each name, address, age, and so on. By setting a value of k, dependent on the application, we can guarantee difficulty in re-creating and de-identifying the data set. Datafly \cite{datafly} is a system that can de-identify data automatically, removing singletons, changing social security numbers (SSNs) and dates of birth, changing ZIP codes, and the like.

% TODO: Talk about the MIMIC II database, and the challenges they face, based on the conversation with Mauricio.

De-identification and privacy problems are nothing new to the biomedical informatics world. Access to, and use of, medical data faces heavy regulation, most prominently from the Health Insurance Portability and Accountability Act (HIPAA) of 1996. The Multi-parameter Intelligent Monitoring for Intensive Care database, or MIMIC II (\url{http://mimic.mit.edu/}) is a database, provided by an MIT research group, for research purposes. The MIMIC group has published many papers discussing and evaluating different de-identification algorithms.

With access to the database, our system could potentially anonymize data using any of a number of well known techniques. This would allow us to keep original data, and only serve data that is as specific as it needs to be for a particular application.