Relational Databases

A relational database store data in a series of tables so that the data models a mathematical theory of relations. The model allows for queries based on projection, selection and join, among other operations, and connect the data in the tables by way of keys. The queries are expressed in a standard syntax called SQL, the Standard Query Language, which is common to all various vendors of relational databases.

The theory of relations states that data is arranged as various sets of tuples, called relations, where a tuple is collection of values for attributes. A relation states which attributes it collects. Concretely speaking, the attributes are the columns of a table, and the tuples are rows in the table. Constraints among the attributes will allow only certain tuples to be valid members of the relation, and the database should not allow rows into be inserted in the table if they would violate the constraints.

For instance, the mathematical theory says that if two tuples agree in the value of all attributes, they are the same tuple. In a table, it is possible for two distinct rows to contain the same data for all columns. However, the database should prevent this from happening because that would not be consistent with the mathematical model.

Schemas and keys

The description of the relations, including the attributes for each relation, and the governing constraints, are the database schema. From the schema we can determine for each table which attribute collections are keys. Loosely speaking, a key is a collection of attributes whose values will uniquely determine a tuple in the relation. For example, in a database of students, the student ID number will serve as a key. The first name, if that is an attribute in the relation, would not serve as a key, since many students could have the first name.

Typically there will be a single column in the table containing a unique integer index from each row of the table, and these will be the primary key for the table. But the definition for key is more abstract, and this definition must be used in order to understand normal forms.

Super key
A super key is any collection of attributes in a relation whose values will uniquely determine the tuple. To determine if a collection of attributes is a super key, the schema must be refered to, since the current data in the table might not demonstrate the most general behavior of the data. As a trivial example: the collection of all attributes of a relation is always a superkey.
Candidate key
A candidate key is a minimal super key. If a collection of attributes is a super key, but removing any attribute from the collection destroys that the collection is a super key, the collection is a candidate key. As a trivial example: if a superkey contains only one attribute, then it is also a candidate key. (Actually, not always. But the exception would hardly appear in the real world.)
Primary key
A candidate key declared to the database as a key, and the database will insure that the data model is not violated for this key.
Foriegn key
The name for a candidate key when it appears in a relation other than the relation for
Natural key
A candidate key which is also useful data. That is, it doesn't include an attribute invented for no other reason than to create a candidate key.
Surrogate key
The opposite of natural key. An attribute invented exclusively that the relation which contains it have a candidate key.
Prime attribute
An attribute which appears as an attribute for some candidate key. (It needn't be the primary key, or even any key of interest.)
Non-prime attribute
An attribute which is never included in any candidate key.

Normal forms

There are a sequence of normal forms determined by the schema of the databse. It is cosidered A Good Thing if a database ranks high in this sequence of normal forms.

First normal form requires that attributes are atomic. There should be no internal fields in the attribute, or if there are, the relations and schemas don't care about it. Typical examples of a violation of first normal form would be a relation which relates a parent attribute with a list-of-children attribute. If the intention is to query of the individual children of the list-of-children, this schema would violate first normal form, as the attribute list-of-children has internal structure of interest to the database. The solution would be to create a new table with two colums: parent and child, and each parent-child relationship would be an individual entry in the table.

Second normal form requires that the schema be in first normal form and that any non-prime attribute depends on the entirety of any candidate key, not on just some attributes of the candidate key. This must be ascertained (by looking at the constraints declared in the schema) for all candidate keys, not just the primary key, or for just candidate keys of interest. An example of a violation of second normal form would be a relation which relates parent with child with gender-of-parent. The pair of attributes (parent,child) is a candidate key, as it determines uniquely the row of the table (a parent can't have two genders). The entire three columns is a superkey, but any other combination is not a key, and therefore gender-of-parent is non-prime. However gender-of-parent does not depend on the entirety of the candidate key (parent, child), on the parent attribute within the key. Therefore the relation is not in second normal form. To normalize the table, remove gender-of-parent from the relation and create a new relation storing the parent, gender-of-parent pair.

Third normal form requires that the schema be in second normal form and that there are no functional dependencies among any of the non-prime attributes of the relation. An example of a violation of third normal form would be a relation which relates parent, gender-of-parent, and parent-type, where parent-type has value either father or mother. gender-or-parent and parent-type are both non-prime, and one attribute determines the other. To normalize the table, remove parent-type and create a new relation for the attribute pair gender-of-parent and parent-type. This relation will contain exactly two elements, (male, father) and (female, mother), and so in practice this table would not be necessary.

Finally, Boyce-Codd normal form is a third normal form which additionally requires that any functional dependency be a super key. Given that we have third normal form, the case not yet considered is a functional dependency depending on a mix of prime and non-prime attributes. Boyce-Codd requires that enough prime attributes are required so that the mix contain some candidate key. Also, Boyce-Codd does not limit consideration of the target to just non-prime attributes. (But I'm not sure if this is significant.)

Projection, selection and join

Queries on the relational database are done through the operations of projection, selection and join. In the implemented database, these are expressed in the SQL language using the SELECT statement. We will talk about these operations abstractly, in terms of the mathematical model, and also provide sample syntax for SQL.

Projection
Projection is the selection from each tuple only some of the attributes. A table with employee-id, last-name and first-name can, for instance, be projected onto the last-name attribuute. This would result in a set of last names. Project is used to indicate the columns of interest for the result.
Selection
Selection is the selection from the relation only some of the tuples. This selection is done by providing a selection criteria. For instance, selecting out of a table of all employees those employees who have been employed more than 4 years.
Join
A join combines two relations into one, by aligning equal values for some attribute or attributes common to the two tables. When a complex relation has been broken down into several simpler relations, a join will recombine the relation. For instance, one relation might express which department each employee is a member of; another relation might give the home address for each employee. To make a list of home address for each employee in a given department would require joining these two tables along a common attribute which idenfities each employee.

Though diverse, all these operations are accomlished in SQL by the SELECT statement. It is of the form:

   SELECT attributes-with-commas FROM tables-with-commas 
   -> WHERE conditions-on-attributes ;

Projection is accomplished by naming the attributes of interest in the attributes-with-commas part of the selection. Selection is accomplished by writing the selection criteria in the conditions-on-attributes following the keyword WHERE. The join is accomplished in the FROM part of the statement, either using the keywork JOIN or simply indicating all tables which are involved in the join.

A formal model

A formal model for relational databases (as well as a strong criticism of SQL!) is given in the paper The third manifesto by Darwen and Date. I add some notes here for the curious.

The theory is pretty close to the reality of tables. An attribute is a name. Associated with each attribute is a domain, which is the same as a data type. A header is a collection of attributes with their associated domains. A header ressembles a structure defnition in programming, giving field names and the type of each field. A tuple for a header is a set of values, one for each attribute in the header, and of the specified type for that attribute. This would be a row in a table or an instance of a structure in programming languages.

A relation has a head and a body. The head is a header and the body is a collection of tuples appropriate for the header. Tuples cannot be repeated, nor are they ordered. In that sense the body is a set of tuples. The relation then models a table at a certain point in time (except the Third Manifesto did not rule out infinite tables, but we won't take this too seriously). Finally a relation variable (relvar) is a variable whose domain is relations of a given header. A relvar corresponds to any possible relation.

The concept of key is then relative to the relvar domain. That is, a candidate key is a subset of the attributes of the header which unique identifies a tuple in a relation for any value of the relvar.

Locks and deadlocks.

Consider a database operation of deducting an object from stock. For the quantity must be tested, then decremented. Suppose in our web application, two users, connecting in separate with two browsers, need to deduct from stock the same item. It is possible that they both arrive at the database at the same time, and both deduct the item.

SQL databases provide several solutions. The simplest is provided by the database guarantee that update operations are atomic and serialized. If several update operations are commanded at the same time, the result in the database will be consistent with choosing and order of the operations, and running them one after the other, each operation completeing before the other beginning.

If table I has stores the Inventory, and the book ID B has stock quantity Q, the required update is:

update table I field Q=Q-1 where B=book_id and Q>0
and then checking the number of records updated, either zero or one. If zero records were updated, stock quantity wasn't sufficient. If one record was updated, then the stock quantity was sufficient and the book sold was taken out of inventory.

Other situations might require more complicated techniques. Tables can be locked, so that only one process can operate on the table at a time. Locks come in two forms, write are read. A write lock allows for any read, but if a table is write locked only the lock owner can write. A read lock prevents any access, read or write, by any but the lock owner.

Locks have a very definite problem: if two processes compete for a set of locks, it is possible that each of the processes claim some of the locks mutually block each other from proceeding. This is called deadlock. The lock operation of SQL can prevent deadlock by guarenteeing that locks are taken in a standard order, so that there cannot be a circular dependency of locks which cause deadlock.

As a final technique, SQL provides for transactions. These wrap multiple operations into a single, all or nothing package.

Example database design

We present a database schema for a book store.

Us: User table
Keeps username and user information, such as full and password. A surrogate key is primary.
Ad: Address table
Keeps addresses including home, shipping, previous, billing, and so forth. Problem is we might want to keep former addresses, multiple addresses for each customer, and special ship-to addresses, such as gift ordrs. A surrpogate key is primary.
Pr: Product table
A database of products (books). A surrogate key is primary, although the ISBN is also a candidate key and can be migrated. Prices are stated as list, with various modifiers for discount and specials.
Au: Author table
In support of Pr, since a product (book) can have multiple authors.
Fi: financial information
Payment methods for each user.
Pa: Payment table
How payment was made for a particular purchase. Includes separate rows for gift certificates, discounts, one-off adjustments, split payments, and authorization numbers for particular payments..
Pu: Purchase table
Represents a purchase, and its purchase identifier is a foreign key for a table of line items. Seems like I've limited myself in that a purchase will have a single payment, whereas perhaps it should have many (split payment? gift certificates?) Might need to look into this.
Li: Line-item table
A purchase includes mulitple products, each a line item. This table has a lineitem identifier local to the purchase so that line items can be listed with a consistent numbering each time the purchase is referenced.