Free Java Development Seminars. Click here.
DBMS, March 1998
DBMS Online: SQL for Smarties By Joe Celko

Synchronizing Data Access

Critiquing the Three Major Types of Concurrency Control.


"Pardon him, Theodotus; he is a barbarian, and thinks that the customs of his tribe and island are the laws of nature."
      ı George Bernard Shaw,
Caesar & Cleopatra

I was traveling a lot at the end of last year, playing consultant and SQL trainer all over Europe and the United States. Itıs good to get out in the world and see how other people do things. Sometimes you can pick up a new trick, but more often than not, you simply find that they are as screwed up as you are. There are good reasons that Dilbert is translated into many languages.

People in the United States have argued for a universal identifier for a number of years. In a CompuServe posting, Malc Smith reported his experiences in Norway, where bank cards double as national identification cards and have each individualıs "official number" on them. The official number is based on your birthday, three digits, and a check digit. Malcıs birthday was encoded as the 32nd day of February, 1999.

He suspects that it was because he was a foreigner; the system used an error code for the birthdate, but it was not properly trapped by the system. He had plenty of fun using that card to hire cars because people in Oslo did not know what to do with the number, and yet it was utterly official.

I spent a week with a national public broadcasting system in Europe. We have the UPC (Universal Product Code) bar codes on almost everything sold in the United States. The Europeans use an EAN code (European Article Number) for the same purpose. It is a unique code system that was established in the mid-1970s following the adoption of the UPC in the United States. The EAN is made up of a country code (two digits), a manufacturerıs code (five digits), an item reference (five digits), and one check digit. This is EAN-13. There are other variants, for example, to identify a unique pallet in transit. For more information check out the Article Numbering Associationıs Web site at www.ana.org.uk.

Whatıs missing is a code for identifying broadcast programs. Everyone in the industry seems to want one, but nobody has any idea of how to set it up. There is the ISBN for books, the VIN for vehicles, and catalog numbers for music played on the radio.

Everything in the television industry is based on the current title of the show, usually kept as a 40- to 50-character string. This means that "Benny Hill," "The Benny Hill Show," "Benny Hill Show," and all possible versions of the name of a show have to be resolved by hand. Likewise, if a program changes names, you have to know when the change was made.

This seems weird to me because you have to track shows to pay royalties. Money usually improves data processing. If any reader knows of some scheme for a "Universal Broadcast Program Number," or whatever itıs called, please let me know.

OpenGIS

Before I sound like I am depressed about the lack of international standards, I should mention that the geospatial data database people have been busy setting up standards. In September of last year in Cambridge, England, the OpenGIS Consortium (OGC) (www.opengis.org) met and heard four proposals and ratified three, with the result that OpenGIS is now a reality.

OGC is an international consortium of more than 100 corporations, government agencies, and universities. It coordinates collaborative development of the OpenGIS Specification and collaborative business development efforts to support the full integration of geospatial data and geoprocessing resources into mainstream computing.

The three ratified submissions include OpenGIS Simple Features Application Programming Interfaces for OLE/COM, CORBA, and SQL. GIS demonstrated its interoperability at the OpenGIS ı97 in Strasbourg, France, on November 11 and 12, 1997.

Concurrency Control

Concurrency control is the way that a database synchronizes usersı access to data to ensure that they do not destroy each otherıs work. It can be divided into three major types: locking (or pessimistic control), generational (or optimistic control), and logical locking.

Locking is the way that most databases handle concurrency control. While one user has a lock on a subset of the database, that data is protected from other usersı attempts to make changes to it. One userıs changes have to be commit-ed before another user can see or further change the same data.

A locking scheme has two major characteristics:

A dynamic locking scheme will automatically increase the granularity of locks from row to page to table level when a certain threshold is met in the more fine-grained locks. The goal is to use the least amount of system overhead per lock while getting as many users safely into the database as possible. The usual approach is to start with a higher level lock, then reduce its granularity only when there is contention from another transaction that wants to lock the same page or table for an update, delete, insert, or select. This minimizes the total number of locks.

The other approach ı initially using a lower granularity and escalating it when a threshold level is reached ı has some problems. For example, escalation from row-level to page-level locks can cause deadlocks when two transactions attempt to acquire page-level locks on the same page at the same time.

Locking is also called pessimistic concurrency control because it assumes that conflicts among the users are to be expected. Optimistic concurrency control assumes that conflicts among the users are rare enough that they should be handled as exceptions instead of expectations.

Most optimistic schemes use a "time stamp" of some sort, which gives the user a snapshot of the database as it existed in a consistent state at a particular moment in time. Hence the term generational control.

Granularity still applies in this model, but it is now the level at which a time stamp exists; it is usually at row level, but it could be set at the table or column level. Time stamping for physical pages does not make much sense because each row within the page can be changed at a different time. The important point is to have a time zero (t0) point at which the database was internally consistent and from which all changes are then based. When user number one comes into the database, he sees the t0 version of the database and works with a local copy of it for his changes. When user number two comes into the database, two things can happen. Either user number one has done a commit and user number two sees the new version of the database, which we will call version t1, or he also sees the t0 version of the database because user number one has not done a commit yet. User number two now has his local copy of his changes, which we will call version t2.

The first case is no problem. The assumption in an optimistic system is that this is what happens most of the time. In the second case, when user number one or two tries to commit his work, he will get a message that the other guy has a copy of the same data and he is blocked until:

Notice that granularity plays a big part in this approach. If the time stamps are at the table level, then only one transaction at a time can change it. In effect, you would have a job queue like we did in the old tape file systems. The row level is workable if it is unlikely that two users will need the same row at the same time. But you could go down to the column level and allow different transactions to change different parts of the same row.

Logical locking is still experimental, but it is based on looking at the predicates in the SQL statements and computing overlapping subsets.

Instead of putting physical locks on rows or pages of storage, all transactions go to a scheduler that decides which transactions can be active in the database at the same time. For example, given a table, if one transaction wants to update a status code to x in all rows where "age > 21" and another transaction wants to update the same status code to y in all rows where "hair = blonde," the scheduler has to determine if there is any overlap in the result sets and exclude one transaction from the database until the other has finished. The goal is to determine which transactions can be run at the same time. This approach frees the logical model from the physical model, and it works well with inverted file structures or bit-mapped indexing.

Before you think this is the ultimate concurrency control method, you need to do a little more inspection. Imagine that we have transactions A to J coming into the database in alphabetic order with the following execution times in some unit:

A = 8
B = 2
C = 3
D = 3
E = 7
F = 7
G = 18
H = 2
I = 8
J = 8

Furthermore, we have examined the predicates in the SQL statements and their time stamps so we have determined that they have to be run in the following order:

A -> E 
A -> F 
B -> D 
C -> H 
D -> F 
E -> G 
F -> G 
H -> I 
H -> J

The transaction on the left of an arrow must complete before the transaction on the right can start.

If they are queued in one at a time, the total mix will require 66 units of time. However, this mix of transactions can be split into two independent queues that can be submitted in parallel; S1 = (A, E, G) and S2 = (B, C, D, F, H, I, J). This mix requires 33 units of time and makes good use of parallelism in the system.

Notice that if we submit the jobs in the order (A, B, C, H, D, E, F, G, I, J) we will get S1 = (A, E, I, J) and S2 = (B, C, H, D, F, G), which uses 35 units of time! Finding the right submission ordering is not easy.

But life gets worse. Reduce the execution time of each transaction by one unit of time and submit them to two queues in alphabetic order. The transactions sets will be S1 = (A, E, F, G) and S2 = (B, C, D, H, I, J) for a total of 36 units of time. We lost ground simply from the luck of the submissions.

What if the scheduler could handle three transactions sets? You would think that extra parallelism ought to help, but it doesnıt always. This mix would give us S1 = (A, E, G), S2 = (B, D, I, F), and S3 = (C, H, J) with a total execution time of 38 units!

The moral to the story is that complex systems can bite back.

Puzzle

This is a simple accounting puzzle. You are given a table that represents an accounting journal with transaction dates, amounts, and the accounts to which they are applied. Find the number of days between each transaction and post that number on the first of the transactions, effectively giving you how many days until the next transaction against that account. Assume that the table is very simple:

CREATE TABLE Journal
(acct INTEGER NOT NULL, 
trxdate DATE NOT NULL, 
amount DECIMAL (10, 2) NOT NULL, 
duration INTEGER NOT NULL);
INSERT INTO Journal
VALUES ((1, ı1997-03-01ı, 198.95, 0),
(1, ı1997-03-02ı, 98.00, 0),
(1, ı1997-03-13, 498.00, 0),
(2, ı1997-03-02ı, 18.00, 0),
(2, ı1997-03-03ı, 90.75, 0),
(2, ı1997-03-04ı, 498.00, 0),
(2, ı1997-03-15ı, 98.25, 0)); 


Puzzle Answer
UPDATE Journal
SET duration 
= (SELECT 
CAST ((Journal.trxdate - T2.trxdate) DAYS
AS INTEGER) 
FROM Journal AS T2
WHERE T2.acct = Journal.acct
AND T2.trxdate = (SELECT MIN(trxdate)
FROM Journal AS T3
WHERE T3.acct = 
Journal.acct
AND T3.trxdate > 
Journal.trxdate))
WHERE EXISTS (SELECT *
FROM Journal T4
WHERE T4.acct = Journal.acct
AND T4.trxdate > Journal.trxdate);
Since we did not say what you wanted to happen to the latest transaction for each account, this update will not touch those rows.

Joe Celko is an Atlanta-based independent consultant and a member of the ANSI X3H2 Database Standards Committee. He is also the author of two books on SQL: SQL For Smarties (Morgan-Kaufmann, 1995) and Instant SQL Programming (Wrox Press, 1995). You can contact Joe via email at 71062.1056@compuserve.com.
What did you think of this article? Send a letter to the editor.


Subscribe to DBMS -- It's free for qualified readers in the United States
March 1998 Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS (http://www.dbmsmag.com)
Copyright © 1998 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated January 29, 1998