DBMS Interview - December 1994
It's no wonder that proponents of object DBMSs (ODBMSs) and relational DBMSs (RDBMSs) don't see eye to eye. Some object folks simply want the DBMS to go away -- that is, to become invisible to the programmer. It seems that object-oriented programmers simply want to work with objects without having to consider whether they reside on disk (as persistent objects) or in memory (as transient objects). This is one of the many valuable insights in the new book, Object Databases: The Essentials, (Addison-Wesley, 1994) by Dr. Mary E.S. Loomis, director of Hewlett-Packard Laboratories' Software Technology Lab.
In her book, Loomis recounts the evolution of ODBMSs, from the first persistent object managers for C++ and Smalltalk, to the full-featured DBMSs available today. Instead of simply evangelizing the ODBMS approach, she explains clearly and concisely the needs that ODBMSs set out to meet. She also discusses how opinions differ within the ODBMS community, depending on an individual's DBMS-centric or programming-centric perspective.
Dr. Loomis has more than 20 years' experience in software engineering and data management as a professor, consultant, programmer, and technical manager. She has worked with both engineering and MIS DBMSs, and has actively participated in developing both DBMSs and database applications. She has held positions with Versant Object Technology, General Electric, and D. Appleton Company, and was a tenured professor at the University of Arizona. She holds a Ph.D. in Computer Science from the University of California, Los Angeles. Dr. Loomis has written dozens of research papers and trade press articles, and five books.
At Hewlett-Packard, the lab that Loomis directs does database-related research oriented toward solving problems in the HP applications business. The lab is also working on application engineering -- the tools and techniques for building applications from components. The objective of this research, according to Loomis, is to develop a scientific basis for programming, moving the craft from an art into an engineering discipline. The lab also has ongoing work in more traditional distributed information management areas, including the construction and deployment of distributed applications.
DBMS Editor in Chief David Kalman recently interviewed Dr. Loomis at HP's Palo Alto complex. The conversation focused on the fundamentals of ODBMSs, and how the technology can simplify application development. The following is an edited transcript of the discussion.
DBMS: What is an object database?
LOOMIS: That's a very good question to start with because, of all of the object-related technologies, I think that object database is the least understood. When I say object database, I am referring to those data-management products that are specifically designed for use with an object programming language and are very closely coupled with one or more object programming languages. An object database, therefore, provides DBMS capabilities to objects that have been created using an object programming language, such as C++ or Smalltalk.
In Object Databases: The Essentials, you say that object programmers expect the DBMS to be invisible. What do you mean?
I mean that the typical C++ programmer who is not a database person, is not familiar with SQL, and is not familiar with relational systems, would like to deal with the object space as a very large extended virtual memory of objects. That person doesn't care whether the objects are in main memory, on disk, or across the network. The programmer would like to use the same syntax for accessing or traversing objects, regardless of what state the objects happen to be in -- transient or persistent. The programmer would like to have those objects behave the same way.
In a sense, programmers do not want to have to think, "I haven't yet accessed this object from disk, therefore, I have to use SQL." They would like to use C++ and have the system underneath know when an object is not in the cache and get it from disk in a way that hides the distinction between cache or main memory and what's on disk.
Some database specialists would argue that a true DBMS should be separate from any programming language or access method. How do you view this difference of opinion?
There has been a very deep distinction between the classic database perspective and today's object database perspective. However, this object perspective is moving more toward the traditional DBMS concept of databases being repositories of objects that are available to multiple applications.
The difference in philosophy showed up clearly in the first ODBMS products (such as Object Design's ObjectStore), which were specifically intended to be persistent storage managers for object programming language objects. Others, such as Versant, were intended to be DBMSs that used an object model rather than a relational model. Part of the distinction between the two types of ODBMS came from the people in those companies: programming language people versus database people. Those who wanted persistent storage discovered that they needed to share objects, so they moved toward incorporating the DBMS functionality. And those vendors with ODBMSs that emphasized the database part found they had to sell into the C++ and Smalltalk communities, where the programmers didn't want SQL interfaces, didn't know what queries were, and wanted the performance and close coupling that comes with the programming language. Those camps are coming together.
For which kinds of applications are object databases best suited?
The applications for object databases are changing as ODBMSs mature. Initially, ODBMSs were used primarily for persistent storage for single-user applications executing on client workstations, where performance was absolutely the most important criterion. That boils down to computer-aided design (CAD) applications. In fact, it was ECAD first and then mechanical CAD where the [ODBMS] technology was first picked up. The relational products could not provide the performance the CAD people needed, and C++ in particular had already been adopted by that community. You can now expect that ODBMSs will be used increasingly in mainstream data-processing applications for manipulating complex data that doesn't fit very well into the typical relational structures.
There has been an evolution in ODBMS technology as vendors have added database capabilities such as transaction management, locking, and shareability, to support multiuser applications on servers as well as workstations. The ODBMSs also support query processing, and tools are beginning to appear.
We're seeing a trend toward a fairly extensive use of the ODBMSs in the financial community; for example, for financial portfolio analysis where there is a lot of computation with time-series data. With a traditional relational system, it's pretty difficult to support that kind of calculation and get the kind of performance that the applications need. There is also an increase in the use of ODBMSs in the telecom industry, because the data structures that the applications need are represented by tightly interconnected networks of entities and the relationships among them.
When you say "typical relational," do you mean products based on the relational model or do you mean the relational model itself, as someone like Chris Date would describe it?
I do make a distinction between the relational model -- the theory and body of knowledge -- and what has been implemented in the relational products. I also make a distinction between the relational model and SQL The relational products are based in the more formal technology, but they have not completely implemented -- or in some cases accurately implemented -- the theory. I'm really talking about the products.
At some point in their maturation, will object databases embrace the relational model? Will object databases become more relational than the current relational products?
I'm not sure I would put it in those words. I agree with most of what Chris Date says about the correspondence between domains in the relational model and types in object databases. (See DBMS Interview, October 1994, page 62.) But I don't think that the notion of domain was developed [in the relational literature] the way that the current notion of type is. You can look back now and say, "Oh yeah, by domain we meant the notions of behavior and notions of subtyping," but that was not initially exposed as part of the relational theory. If you agree that domain equals type, and you agree that by "domain" we meant all the stuff we now mean by "type," then you can argue that that is what the ODBMSs are trying to implement.
In your book, you describe how ODBMSs "tear down the walls" and eliminate "vaulting" over barriers in the development cycle. How do they do this?
Tearing down the walls between disciplines or stages of development is predicated on having a semantically expressive object model for specifying requirements and stating design considerations, and for driving implementation so you don't have to keep transforming from one model to another model. It turns out that as you go further and further toward implementation, you have to consider factors that you don't have to consider for the requirements side of things. You need to keep augmenting the model with more physical kinds of factors, but you don't have to change the base model of what the objects are.
Even if you use an object model in your requirements stage, isn't there a point where you have to transform that model into something more dependent on the implementation?
Yes. And that vaulting shows up very clearly if, for instance, you do an ER model for design requirements and then you have to figure out how to express your entities and relationships in relational tables. It's a totally different model. Typically, after figuring out the relational tables -- the base tables, views, and so on -- people tend to forget all about their ER models. And most people certainly don't keep their models up to date to reflect changes in the tables. So, it becomes more and more difficult to predict the effects of changes in the physical implementation and determine what has to be changed if you want to add new semantics. Having a consistent model throughout is a more productive way to develop systems because you don't have to do all of these transformations at different stages. This should result in high-quality systems that are easier to maintain.
Why not just compute across or over the mismatch in models, as Dr. David Kroenke suggests with his Semantic Object Modeling? Isn't it feasible to let the system deal with the differences in models and compute across them?
It's certainly feasible. Even today, you can take an ER model, which is the poor man's semantic model, and automatically generate relational tables that correspond to it. Several tools do that. The difficulty of maintaining two different models, even if they're computable or generated, is that at execution time you have the overhead of transforming one model into the other. That's where some of the performance hit comes in using a relational database with an object programming language. If you specify a query that requires a join instead of a query that can be processed by a logical traversal among objects, then you've taken a hit in performance. That's one of the difficulties in having a mismatch in the models. You can also lose semantic information.
What semantic information can be lost?
The type/subtype relationships get lost very easily when you go to the relational model from the object model. There are lots of ways to translate a type hierarchy into tables. One way is to make each type a table with its own table definition, and make each subtype down the tree (at whatever level in the tree) correspond to a particular table. So, you've got this bunch of tables and you have to rely on how you've named them to figure out the type/subtype relationships.
Let's say you have a type hierarchy that has Employee as a supertype and Hourly Employee and Salaried Employee as subtypes, and that translates into three tables. If you want to add an employee, and you figure out it's an hourly employee, in the relational system you have to put a tuple in the Employee table and Hourly Employee table, but not in the Salaried Employee table. If the application makes a mistake and puts a row in all three tables, then suddenly you have someone who is both hourly and salaried, which conflicts with the model. So, you've lost what the type/subtype hierarchy looks like. That's just one example of where the semantics are not inherent in the table structure.
Are there other semantics that are either lost or just difficult to model in a relational context?
Let's look at time-series data, where you have a sequence. You have to look at this bit of data first, and that bit of data next, and so forth. Let's put each of those bits of data, such as the price for a stock for some period, as separate rows in a table. The ordering of rows in a table in the relational model is insignificant. There is no guarantee when you retrieve the table that you will get the rows in the same order that you stored them. You have to add some kind of a sequence number that will enable you to reorder or resort. It then becomes an application responsibility to make sure that the logic is there to retrieve them in that particular order. If the time-series data were modeled as a time-series type, which is inherently sequenced, then the application just has to ask for the next one. The application does not have to worry about what the sequence is, or about doing the series itself.
The relational model gives you the flexibility to specify a logical order at any time. What if I want to look at the ODBMS time-series data in a random distribution or in reverse order? Where do I get that flexibility?
In the time-series example, there is just an inherent sequencing based upon time for these entries. So, a time-series data type might have forward iterators or backward iterators -- different ways of accessing the elements in that sequence. The sequencing is based on a fundamental, logical notion inherent in the type. However, there is a distinction between the logical structure of the time series and the physical implementation of how the elements might be clustered. If you want to access the time-series data by the values of prices, which are not a part of the inherent structure of the time-series, then you would have to do it the same way that you would in the relational system: do a query, with an ORDER BY clause. The flexibility isn't lost; it's still there. It's just that more of the semantics have been built into the objects.
How would I model the time-series at design time? What do I tell the system?
You would probably define a type called "time series," which you could apply to many different kinds of data that could appear in that time series. Then, you would indicate that there would be some sequencing based upon a combination of the date and time. Then, you would indicate that there are certain operations that can be performed on the elements of the time series or on the time series as a whole. So, you might indicate operators called Next, First, and Last.
In ODMG-93, there is agreement about the object model and the syntax and semantics of ODL (object definition language) for defining types. ODL is independent of C++ or Smalltalk, and is an extension of OMG's (Object Management Group) Interface Definition Language (IDL). ODMG-93 is the vendors' attempt to standardize ODBMS interfaces in much the same way that SQL has been standardized for relational systems.
What are some of the approaches to using relational DBMSs with object programming languages?
A difficulty of using a relational database with an object programming language is the model mismatch we've discussed. You have to transform the program's notion of "object" to something that can be understood by a relational DBMS. The structures need to be transformed into table structures, and the access needs to be transformed into SQL because that's the only interface that the relational system will support. In some cases this is trivial. In other cases it can be very complicated. it's in those very complicated cases that an object database can make your application development go more smoothly.
There are many cases where you want to write applications in an object programming language, and access existing RDBMS data. In most of those cases, you don't want to replicate the data in an object database as well, because you get into the question of which is the "real" data. Using a relational database with an object programming language is the right answer, and there are lots of ways to do it. There are gateway products. You can write class libraries and encapsulate in the methods of those classes all the access to the relational database. It's work that has to be done.
HP recently introduced a product called Odapter, which specifically addresses this problem of getting easier access to relational databases from object programming environments. The whole notion of the Odapter is to provide objects on top of the relational system -- to hide the relational system. Odapter automates as much as possible of this process of figuring out how to make the relational system appear to be objects. It will scan SQL table definitions of existing databases and figure out the corresponding C++ to make object definitions. It also determines what SQL is needed for updating and retrieving rows in those tables. It does it in a way that will eventually support distributed databases as well. There are also other gateway products on the market. Many of them have visual tools for helping to guide the transformation.
For what environments is Odapter available?
Today, Odapter is available as an object management layer on top of Oracle. It is accessible from a variety of languages, such as C++, Smalltalk, C, Pascal, and Ada, and there is work going on to make it accessible from the visual programming environments as well. It also interfaces with Information Builders' EDA/SQL so it can access whatever EDA/SQL can access. OpenODB is a variant of Odapter that sits on top of HP-Allbase/SQL, HP's relational DBMS.
We don't hear much about Allbase in the marketplace. By keeping a low profile, is it HP's intention to clear the way for third-party RDBMSs and add value to them?
HP's strategy is to have very strong third-party relationships. That's what customers want. There is a very strong third-party program, and a whole variety of relational vendors are represented there.
How do the ODBMSs today provide the features that people expect from traditional DBMSs, such as concurrency control, query languages, and tools?
Most of the ODBMS products today provide transaction support, query support, and, by themselves or with third-party support, some tools. Let's take transactions to start with.
Concurrency control is provided in an object environment using basically the same techniques as an RDBMS. There's two-phase commit, the same locking and logging, and before imaging and after imaging. That technology has been well thought-out and specified, and implemented in the ODBMSs. Also, some of the ODBMS products have gone beyond traditional concurrency control to provide variants on the transaction model, providing long transactions or optimistic concurrency control and a choice of the kind of concurrency control.
On the query processing side, query performance depends largely on things such as indexing, clustering, moving bunches of objects at the same time, and so forth. All those same techniques that were developed for the relational databases apply and are implemented -- more or less -- by the ODBMS vendors.
Is the transaction model specified in ODMG?
Not completely. In ODMG-93 there's a basic transaction model. Maybe more details will appear in ODMG-95.
How do you specify a query in an object language?
ODMG-93 includes an object query language that's based on a product from O2 Technologies. It's a superset of SQL, and there's an effort underway to bring it together with SQL3. So, one way to specify queries is to use a SQL-like language. Another way to do a query is through the object programming language itself. Smalltalk has a method called Select, which is used to specify a parameter to be applied against a collection, and to select the elements in that collection that meet that criterion. The syntax is totally different than SQL, but you can specify basically the same kinds of queries. Some vendors now support embedding SQL strings in C++. It's not always very clean, but it can be powerful.
Another approach used by some vendors is to extend C++ and use a preprocessor. In C++, there's a square bracket syntax that you use to specify the index into an array. Within those square brackets, you can put a selection expression as you would in a WHERE clause, and apply that against a collection. A collection is a group of objects that have the same type, as opposed to a structure, which is a group of objects with different types.
Some ODBMSs execute behavioral code and some leave it to the application. How do these DBMSs differ in their architectures and in the kinds of applications they support?
It's a fundamental tenet of object models that an object is some encapsulation of both state and behavior -- the data structures and the methods. Yet many of the ODBMSs today only look at the data structure side of things. They don't actually store or execute methods. They leave that completely to the application programming environment, because they are so closely coupled with the object programming languages. However, the trend is toward being able to store and execute the methods -- the procedural part -- in the database as well, especially in the Smalltalk environment. An object should be in an application if it's application-specific and used in only one application program. If it's shared in the same sense that the data part of the object is shared, then that object should be a database object instead of just a local persistent object.
The kinds of things we do with stored procedures today look like good candidates to continue to execute in the database. There are several products now that are truly object engines; that is, they execute methods. The Smalltalk-based engines do it, such as Gemstone from Servio. It has always been able to execute methods in the database. The reason is that its engine includes Smalltalk. I believe Versant is now executing in the engine, and Itasca is too.
There have been some research prototypes, such as HP's IRIS, that have always executed in the server. IRIS, which became OpenODB, is more of a SQL-based extension to the relational environment and has more of the database perspective than the object programming perspective.
Do you see these object engines spanning clients and servers, and executing methods in various locations transparently?
Yes. The overall perspective of an object environment is one where components are located wherever it is most appropriate within a network, and they ask each other to do things. They request services of each other, so the thread of control goes from one object to another in the process of responding to some request. All that activity underneath, regarding which objects are invoked, where they reside, and how they have to communicate, should be completely transparent. Some might be running on local client environments, some might be on local server environments, and some might be off in some remote enterprise environment.
Will object languages and DBMSs become the dominant tools for building systems and accessing information?
Application engineering through component programming -- where you select preprogrammed components from libraries and put them together and then customize them for particular needs -- will be the dominant paradigm. Those components may be engineered using a whole variety of languages. Perhaps it is easier to engineer components with object programming languages than with non-object-oriented languages.
People will use visual programming techniques to assemble components, and combine them with scripting languages. The low-level components may be implemented using object programming languages. I don't think that the huge mass of Cobol programmers will become C++ or Smalltalk programmers, although more of them will become Smalltalk programmers than C++ programmers. I really think that these visual programming languages will make a huge difference.
Many of the commercial development tools seem to implement what we call "object-based" or "pseudo OOP" concepts...
At the upper levels, that's perfectly appropriate, but lower-level components have to be engineered for very high performance, guaranteed interoperability, and so forth. Programming languages such as C++ may very well be used for those kinds of components.
What is the role of software technology in HP's strategy?
Software is a very important enabler for a whole variety of HP-provided solutions, whether it's in the healthcare, analytic, automotive, financial, or telecom arena. Software and services are essential components of what we call "HP-enabled solutions," where customers are actually writing the applications and putting the pieces together. HP views software as an enabling solution, but not as an end unto itself. The emphasis is on leveraging depth in particular technology areas to solve real problems in our application divisions.
Will you continue writing about object databases?
I've been doing a series of columns for the Journal of Object-Oriented Programming on object database topics. Those columns provided some of the working papers for my book. I expect I'll continue to do some of those columns, although I can't continue with the same frequency. Every time I write a book, and this is my fifth one now, I boldly state that I will never do a book again, and I always end up doing another one. I expect I will, but I don't know what it's about yet.