
Exploding interest in the Internet and World Wide Web as application platforms has had a significant effect on the relational database management system (RDBMS) marketplace. Initially, Web hype reduced the profile of the major RDBMS combatants and overshadowed their emphasis on areas such as parallel database operations, data warehousing, and data replication. But users developing Internet/Web applications quickly recognized the need for a scalable, robust environment in which to store, manipulate, and manage the dynamic delivery of multimedia data and other complex datatypes. Connecting the DBMS and Web applications via a Web server also requires effective support for a three-tiered application architecture. In fact, a Web application can represent a microcosm of the full range of business requirements -- scalability, performance, access to integrated data across an organization, transaction management, deployment on multiple platforms, Java support, and so on.
When the Web came along, most RDBMS vendors were already working on extensibility to handle more complex data and applications -- Web applications simply provided an immediate need and a focal point for these efforts. Thus we are seeing a resurgence of interest in RDBMS technology as an underlying platform for complex-data applications, particularly those implemented on the Web.
This article addresses the application requirements that drive the efforts to extend the RDBMS to handle complex data. It also describes the components of an extensible data management architecture and the core features required to achieve RDBMS server extensibility. Part II of this article, which will appear next month, will discuss how the five major RDBMS vendors -- Informix Software Inc. (Menlo Park, Calif.), IBM Corp. (Armonk, N.Y.), Microsoft Corp. (Redmond, Wash.), Oracle Corp. (Redwood Shores, Calif.), and Sybase Inc. (Emeryville, Calif.) -- plan to support data extensibility.
In addition, corporate information systems are becoming more complex as organizations strive to design and integrate operational, data warehousing, and Web information systems. (See Figure 1.) A Web information system, for example, lets users access corporate information using a Web browser and Web-based applications; it essentially Web-enables anything.
To meet these application requirements, many organizations are looking for a single database platform that applies scalability, transaction integrity, proactive enforcement of business rules, and other robust DBMS functionality to complex data as well as traditional data. Because the RDBMS is already widely used for traditional applications, it makes sense to explore the possibility of extending it to manage a wider range of datatypes, application-specific semantics, and complex data relationships intelligently. This evolution of the RDBMS is referred to as "extended relational" (because the underlying data model is extensible) or "object-relational" (because the RDBMS can now understand "rich datatypes" or "objects" that represent complex internal structures, attributes, and behavior and require new search methods). A common term for products in this space is "universal server" or "universal database."
Another factor that is driving efforts to extend the RDBMS is the ever-present desire to increase developer productivity. One aspect of this is providing a single API to all data so that developers don't have to contend with multiple APIs depending on the type and source of data. An example is retrieving structured data from an RDBMS using SQL and retrieving related information on documents using the native API of a text search engine. If the data is related, why not just use one API, such as SQL, and have a single database server manage both types of data? The controversy that is evolving in this area is whether SQL is the appropriate language for this process. The object-relational approach pushes SQL3 as the single API. Microsoft, on the other hand, would like to see its OLE DB common object method interface take on this single-API role.
A second aspect of developer productivity is pushing support for object modeling techniques into the database server itself. These techniques include the ability to encapsulate data and its associated methods as objects and to reuse code through features such as inheritance and polymorphism.
There are two important points to be made here. One is that many people in the industry confuse objects with datatypes. Objects encapsulate both data and methods. Adding new datatypes to the RDBMS is only one step toward supporting true objects. Also required is the ability to define new methods and to associate these methods with the appropriate datatypes.
The second point is the need to differentiate the object-relational approach from that of object DBMSs. Object-relational DBMSs deliver some object capabilities, but they do not yet offer the same level of support as object DBMSs for features such as encapsulation and inheritance. In addition, it is unlikely that object-relational DBMSs will fully support features such as pointer navigation and tight integration with object-oriented programming languages to provide persistent storage of native objects created in an application. I expect object-relational products to move in this direction to gain the benefits of the object approach where possible, but object features will be implemented differently in object-relational, given its underlying data structure of tables and columns.
There are two key differences among these approaches. The first is how and where the data is managed. Is all of the data tightly integrated and managed by a single DBMS server? Or is the data loosely integrated and managed by multiple servers? The second differentiator is where query optimization occurs and how well it performs. Query optimization maps the logical view of data to the physical view and decides the best way to execute the query. Is this handled by the database server (the universal server approach) or in middleware (the OLE DB approach)? Good optimization is particularly important in an environment in which data can be stored in any combination of one table, multiple tables, multiple databases, and/or files outside the DBMS.
An "extended universal server" accommodates the fact that there may be very good reasons (such as performance) for not storing all data in the DBMS. So the DBMS must also be able to efficiently access data stored in external files. Large data values -- images, for example -- can be stored externally, and a pointer to each image file is stored inside the database as a column value. An additional step is enabling the DBMS to also manage and ensure the integrity of this external data. The only RDBMS vendor planning to address the latter is IBM, with its robust file-links. However, I expect efforts in this area to receive much more attention in the industry over the next 12 months.
Object purists already criticize the extended-relational approach because of the fact that the DBMS must decompose objects into relational tables -- rows and columns -- for storage and then rebuild them into objects before delivery to the user. The vendors extending relational are very much aware of the performance issues and the need to avoid the overhead of joins where possible. My philosophy here is the following: If product performance meets user requirements, it doesn't matter what the product does under the covers (unless, of course, the architecture hampers the ability to add future enhancements). In the real world, user experience will demonstrate whether the extended RDBMS vendors have been successful in adapting the relational model for complex-data support. I discuss specific universal server extensions in more detail later in this article.
Microsoft's OLE DB and DCOM, and other object request brokers (ORBs), on the other hand, represent another type of middleware: application middleware. OLE DB is an interface that was designed to provide universal access to data. It "componentizes" DBMS functionality, breaking it up into components that can run in the middleware space or in the operating system, such as query processors, optimizers, and transaction managers. OLE DB will be an integral component of Microsoft's operating systems and servers. In fact, DBMSs were originally developed because of operating-system deficiencies. If these services are implemented at the operating-system level, will we still need the DBMS? This discussion will become particularly interesting if SQL3 doesn't prove that it is up to the task of handling real objects. Oracle's Network Computing Architecture also addresses the ability to extend the data environment at the middleware level; Oracle's Web Application Server will evolve into a generic application server that can provide some database functionality such as transaction management.
How well a middleware approach performs will be determined by several factors: the level of integration among components, how smart the middleware is about data stored in supported servers and the native capabilities of each server, whether the user wants to manipulate the data or simply retrieve it, and the mechanism by which components communicate. Some RDBMS vendors claim that using RPCs to execute queries across distributed data will have serious performance implications. However, the comment I just made about performance as it relates to object-relational also applies here.
One issue with OLE DB in particular and ORBs in general is the ability of third-party software vendors to provide competent database functionality such as global query processing and optimization. Given the amount of effort and R&D investment that have gone into optimization algorithms on the part of the major RDBMS vendors, it is not clear that middleware will offer comparable functionality and performance. Another issue is the number of data sources accessible through middleware. Is the user dependent on the DBMS vendor, or can a customer or a third party integrate data to meet specific business requirements?
Extensible Type System. An extended RDBMS must support user-defined datatypes (UDTs) at both the column and the row level. Column-level UDTs are either distinct or abstract datatypes. UDTs enable the extended RDBMS to incorporate new datatypes and understand complex data or business relationships. Distinct types are relatively simple UDTs that extend an existing base datatype for a column. A strongly typed system will not permit the user to make inappropriate direct comparisons between types with different names, even though they share the same base datatype and length. Abstract datatypes define more complex datatypes that have special internal structures and attributes, such as text, geospatial, or time-series data. As with objects, the internal structure of an abstract datatype is hidden from the user; data is accessed and manipulated using a set of external attributes and functions. Abstract datatypes are defined using SQL (the database engine is aware of the attributes and internal structure) or a host language (the type is essentially "opaque" and seen only as a large object by the DBMS).
A row type describes an entire row or a set of nested columns in a table, providing a way to represent hierarchical "entities" in the database -- customer, employee, and so on -- and identify multiple related columns. Reference types can then define relationships between row types and uniquely identify a row within an entire database. References enable users to replace complex-join definitions in queries with much simpler path expressions. References also give the optimizer an alternative way to navigate data instead of via value-based joins.
Collections are type constructors that are used to define collections of other types, such as arrays, lists, and sets. Collections are used to store multiple values in a single column in a table and can result in nested tables where a column in one table actually contains another table. The result can be a single table that represents multiple master-detail levels. Collections add flexibility to the design of database structures.
An important aspect of object modeling is inheritance, in which subtypes inherit the attributes and behavior of their supertypes. Inheritance facilitates code reuse and the ability to maintain logical integrity in the database.
User-Defined Functions. User-defined functions (UDFs) define methods for manipulating data and are an important adjunct to UDTs. An extended RDBMS should provide significant flexibility in this area, such as allowing UDFs to return complex values that can then be further manipulated (such as tables), execution options so that the user can decide whether performance or security is more important when running UDFs, and support for overloading of function names to simplify application development.
Index Structures. Traditional RDBMSs use B-tree (binary tree) indexes to speed access to scalar data. With the ability to define more complex datatypes in the RDBMS, specialized index structures are required for efficient access to data. Some extended RDBMSs are beginning to support additional index types, such as R-trees (region trees) for fast access to two- and three-dimensional data, and the ability to index on the output of a function. A mechanism to plug in any user-defined index structure provides the highest level of flexibility.
Optimizer. The query optimizer is the heart of RDBMS performance and must also be extended with knowledge about how to execute UDFs efficiently, take advantage of new index structures, transform queries in new ways, and navigate among data using references. Successfully opening up such a critical and highly tuned DBMS component and educating third parties about optimization techniques is a major challenge for DBMS vendors.
Other Extensions. Other important extensions are support for large-object storage either inside the database or outside in external files, the ability to apply business rules and integrity constraints to new datatypes, recursive queries to support complex-data relationships, and extended language support in the server. This last area is key to improving both flexibility and portability. Extended RDBMSs must support the SQL3 standard (in committee draft status now) plus additional languages for writing UDFs and stored procedures, such as 3GLs and Java. Unfortunately, the SQL3 standard does not address some areas of extensibility, so implementation of features such as the mechanism for giving the optimizer cost information about UDFs and new index structures will vary among products. The lack of a standard way for third-party vendors to integrate their software with multiple RDBMSs demonstrates the need for standards beyond the focus of SQL3.
We also need application language extensions to "complex-data-enable" applications on the front end. Application development tools must be extended to take advantage of new server functionality.
The second obstacle is convincing customers that to get extensibility, they don't have to give up what they already have, such as good performance on existing applications. A question to ask each vendor is how integrated complex-data extensions are with existing DBMS functionality such as parallel processing, backup and recovery, data integrity constraints, and data replication. Does the customer have to choose between complex-data support and the ability to use other features such as parallel processing, data replication, or distributed databases? In some cases, the answer is yes.


| TABLE 1. Summary of Object-Relational Extensions | |
| Feature | Included in SQL3 |
|---|---|
| Extensible type system (user-defined types or UDTs) | Yes |
| Support for strong typing | Yes |
| Support for hierarchies of types and inheritance | Yes |
| Data replication support for UDTs | No |
| User-defined functions (UDFs) | Yes |
| Function overloading | Yes |
| Function resolution based on multiple attributes | Yes |
| Extensible indexing system | No |
| Extensible query optimizer | No |
| Support for large objects (LOBs) | Yes |
| Support for external data | No |
| Integrated searchable content | Yes |
| Extended language support | Yes |
| SQL3 and SQL/Multimedia | Yes |
| 3GLs | Yes for stored procedures |
| 4GLs | No |
| Java | No |
| Object-oriented languages | No |
| Predefined extensions available | N/A |
| Facilities (API, developer's kit) for adding extensions | No |
| Application language support for extensions | No |
| Systems management support for extensions | No |