Managing massive databases requires specialized strategies and techniques.
In part 1 of this article (see "Taming Data Giants," Part 1, DBMS, February 1997, page 38), we discussed several factors that contribute to the galloping growth of very large database (VLDB) implementations. In this article, we review critical VLDB administration issues such as performance management and high availability. We also explore the role of indexing strategies, query optimizers, parallel processing, scalability, and three-tier architectures as strategies you can use to effectively manage VLDB implementations. Finally, we describe how several leading RDBMSs support very large databases.
Key-range partitioning will distribute data rows across the database partitions according to a pre-defined key range associated to each partition. Some databases will require that the key-range partitioning be performed only on the (unique) primary key for a table. Database maintenance can be dramatically simplified using key-range partitioning. Reorganizations and index rebuilds can usually be undertaken on a partition-by-partition basis to help manage the scope and impact of such tasks. Key-range partitioning is often performed on large tables using a date timestamp column as the partitioning key. This has the built-in advantage of providing a simple way to age data out of the database by dropping the oldest partition out of a table to make space for incoming data for the next time period to be stored. This offers a straightforward method of keeping a rolling number of time periods of data without requiring burdensome database maintenance.
The alternative, without the key-range partitioning, is to issue a mass delete operation using a where clause to qualify rows by date range that are to be purged from the table. If a large number of rows is to be deleted, the operation will often need to be broken out into many steps (sometimes one day of data at a time between commit points) to avoid blowing up the database because of a lack of log space. Notice that in hash partitioning, even if the date timestamp is the partitioning key, the randomization of data rows across the partitions makes it impossible to isolate the rows to be deleted to a single partition for purging purposes; a mass delete operation is the only choice in such implementations. Finally, smart database engines often can avoid scanning or indexing into large portions of a key-range partitioned database if the partitioning key is contained in the where clause or as a join key in a SQL query. The partition elimination is performed by recognizing the mutually exclusive data content for the partitioning column across partitions.
Round-robin partitioning is the simplest model of data placement of the three; rows are merely placed into partitions as if the database were dealing cards to players on a rotating basis. The main advantage of the round-robin technique is that the data is distributed very efficiently when performing high-volume loading or insert-selects into a large table. With simplicity comes speed, and the round-robin technique can distribute rows into a target table approximately five percent faster than the hashing method of partitioning.
Most DSS-oriented DBMS implementations make use of hash partitioning to ensure an even distribution of data across partitions and to provide co-location of joins when two tables are partitioned on the same key. Co-location of a join occurs when the join column in an equijoin between two tables uses columns that also represent the partitioning keys for the two tables. Co-location allows complete locality within each database instance when invoking a parallel join on partitioned tables, because the hashing function will ensure that any two rows hashed on the same join key must coexist in the same database partition. By definition, two tables cannot be stored in different partitions if they are co-located. The co-located join therefore avoids any need for communication of data between partitions across the database instances managing these partitions.
In a shared-nothing DBMS implementation where data partitions are explicitly assigned to DBMS instances, with function shipping as the means for passing data between instances, co-located joins can be executed significantly faster than joins that are not co-located. If a join is not co-located (the join key between the tables to be joined is not the partitioning columns for one or both tables), then an on-the-fly redistribution of data across the database instances must be undertaken to align the two tables for execution of the join. The cost of the redistribution can vary widely across database products and will depend heavily upon the size of the table(s) to be redistributed and how many rows can be filtered locally based on application of where clause predicates. In databases, such as Informix Software Inc.'s XPS, where communication between database instances is very efficient, the performance penalty for joins that are not co-located can be quite small (the typical penalty for a join that is not co-located in Informix XPS is less than 10 percent compared to a co-located join). On the other hand, shared-nothing database implementations that have not optimized for the cost of full table redistributions will often pay a factor of two or more in performance penalty in comparison to co-located joins.
As we mentioned previously, most DSS-oriented database products make use of hash partitioning techniques for their VLDB implementations. NCR Corp.'s Teradata, IBM Corp.'s DB2/6000 Parallel Edition, and Sybase Inc.'s MPP all fall into this category. Tandem Computers Inc.'s Non-Stop SQL and IBM's DB2 V4 on MVS, both of which have good implementations for high-end OLTP workloads, make use of key-range partitioning for large tables. The Informix database engine allows any of the three partitioning options for its high-end implementations. The Oracle database engine will not support partitioning until its 8.0 release, but it offers some clever workarounds in version 7.3 for the interim. A simulated partitioning is achieved by constructing separate physical tables for each desired partition of what would logically be considered a single table. To present a single table image to application programs, a view is constructed with union all operators between each of the physical tables to provide the illusion of one large table. In previous releases of the database, the cost of view materialization would have been prohibitive for any type of SQL query launched against such a construct.
However, Oracle Corp.'s Oracle 7.3 has introduced some optimizer intelligence and syntactic extensions to allow for relatively efficient handling of access to the unified tables underneath the view construct. In this partitioning scheme using physical tables to implement the partitions, the separate tables typically divide the data rows according to key-range values as the basis for partitioning. In Oracle 7.3, the view construct allows a syntax in which the key-range values associated to a table can be specified as part of the view definition (the specification is not enforced, however, and therefore is more of a hint to the optimizer than an actual view constraint). The optimizer uses the specification in such a way that any access to the data in which the partitioning key is used can be made significantly more efficient by eliminating consideration of any table in the UNION ALL view that does not contain any rows with the desired key value(s). Those tables that potentially have qualifying rows must be materialized into the view, but as many WHERE clause predicates as possible are pushed down into the filters against individual tables to eliminate as many rows as possible before bringing them into the view materialization. Oracle can exploit parallelism against even a single partition (table) of the UNION ALL view using parallel query servers in a shared-everything database architecture.
Although single-instance database implementations on SMP architectures continue to deliver better performance by removing serialization points and implementing more efficient threading mechanisms, multiple database instances continue to have limited success in scaling at the high end on clustered or MPP machine architectures for OLTP workloads. In a multiple-instance database implementation, a single-image database is operated upon by multiple, cooperating database processes. In contrast to a distributed database implementation, in which each database instance is relatively independent, the implementation of multiple database instances in Oracle's Parallel Server environment or Informix XPS has the instances sharing a single view of the database image, rather than separate islands of data.
Within the space of multiple-instance database implementations against a single database image, vendors can choose to deploy either shared-everything or shared-nothing database architectures. A shared-everything database architecture enables all database instances to address and perform I/O against any data block in the database; partitioning is independent of the multiple database instances -- although some implementations may permit an affinity to take place between database instances and table partitions to influence which instances access specific database partitions. In a shared-nothing database implementation, database instances are explicitly assigned to table partitions. I/O to data blocks within a partition must be performed by the database instance assigned to the partition. Any database instance that requires data managed by another database instance uses a function-shipping model to request database objects (not data blocks) to be delivered from the owning instance. Filtering of the data blocks to eliminate unneeded rows and columns is performed locally in the owning instance before shipping result sets back to the requesting instance.
Depending on the implementation, flushing a data block from a database instance will often require writing out the block to disk to ensure recoverability before passing the block for further updates to another instance. In a very high-volume update environment, as is often typified by OLTP applications, this database buffer cache coherence mechanism will commonly result in contention over frequently updated blocks (especially index block headers) in the database, resulting in massive performance degradation as instances struggle to obtain access and maintain coherency for high-demand data blocks. This effect is affectionately referred to as the block-ping problem among those who are frequently called upon to battle this problem. The impact is severe because those blocks that are updated most frequently are penalized the most in the block-ping scenario. Instead of accessing these high-demand blocks at memory speeds in a database buffer cache, access is gated though lock manager requests and disk I/Os required to flush the block from the buffer cache of one instance before bringing it into the requesting instance's buffer cache. The difference between memory access to a data block and disk I/O access is three orders of magnitude -- quite a price to pay in an OLTP environment for access to the most frequently updated blocks.
Transaction Processing (TP) monitors provide a robust middleware layer to insulate the database engine from these issues. Nearly all successful OLTP applications at the high end of the performance spectrum will be implemented using some form of TP monitor. TP monitor products such as BEA Systems Inc.'s (Sunnyvale, Calif.) Tuxedo and Transarc Corp.'s (Pittsburgh, P.A.) Encina (Transarc is now owned by IBM) dominate market share in the Unix environments, whereas CICS dominates in the MVS marketplace (with a small amount of Unix penetration). The advantages of a robust middleware architecture are many. First of all, business applications logic is explicitly separated from the presentation and database access layers of the application to yield a much cleaner modularization of the systems software. Allowing encapsulation of applications objects in the middleware layer in turn allows a level of reusability and flexibility of deployment that is not possible when presentation and business logic are intermingled. The encapsulation of applications objects in the middleware also supports deployment using a thin-client approach rather than forcing monstrous desktops -- which are usually required with two-tier application implementations. When implemented properly, the middleware architecture is particularly well suited for Web-based deployment, because it separates presentation and business logic.
The block-ping issue is addressed explicitly using the data-dependent routing (DDR) features provided in most TP monitors to allow for intelligent routing of transactions to specific database instances. This approach maximizes instance locality in reference patterns so as to avoid contention over frequently accessed blocks. The three-tier architecture support permits a much higher degree of applications scalability because the TP monitor middleware will manage connectivity to the database much more effectively than a two-tier model. A two-tier implementation maintains one database connection for each desktop, but the three-tier model lets the TP monitor manage the desktop connectivity to a scalable set of application servers. The number of database connections that are instantiated will be the number of concurrently executing transactions to be supported (typically, at least a factor of 10 fewer database connections than in a two-tier implementation).
The three-tier model with a middleware architecture provided by the framework of a TP monitor also allows for much better availability characteristics than a two-tier implementation. A good TP monitor implementation can provide high availability via automatic failover if a database instance or application server experiences an outage.
However, the local indexing approach also leads to scalability problems for high-volume OLTP in a shared-nothing database environment. Almost all OLTP transactions will be executed using very selective indexed access into database tables (the purpose of doing so is to deliver acceptable response times). The problem in shared-nothing database implementations with local indexing arises when a transaction is issued for indexed access against a partitioned table where the index column is not the column that was used for partitioning the table. In such a case, the database has no way of knowing which partition will contain the desired row or rows -- and so the transaction is broadcast-issued to all partitions simultaneously. Each partition can then use its local index structure to efficiently retrieve matching rows or to determine that no row with the desired qualification exists.
The problem with the broadcast issue of a transaction is that this approach is fundamentally nonscalable for high-volume OLTP in a multi-instance database implementation. Scalability implies that by adding more hardware resources and additional database instances, we should be able to increase transaction volumes and throughput in a high-end OLTP system. However, if all database instances must participate in all transaction executions because of the broadcast issue of transactions, we have lost all semblance of scalability. Even though the execution of a transaction may be efficient using local indices within the domain of each database instance, once any instance in the system reaches its saturation point, there is no way to increase transaction volumes by adding more instances; any increase in transaction volume must be handled by all instances (because all transactions are broadcast to all instances).
Note that if the desired index path happens to coincide with the partitioning column for a table (in either hash or key-range partitioning schemes) then a transaction does not necessarily need to be broadcast to all database instances. If the access path is based upon the partitioning column for a table, then there can be only one table partition in which the desired row(s) would be located. Thus the transaction could be properly executed without necessitating a broadcast issue by routing it only to the database instance corresponding to the table partition to which any matching rows must be present (if any such rows exist). Scalability for transaction executions that make use of the partitioning key for table access is quite good because broadcasts can be avoided completely. However, it is very difficult to come up with a database design that accommodates this approach for complex applications deployments. The problem is that many tables will need to support more than one indexed access path in real-world applications.
For example, an account table may very effectively be partitioned by an account number (typically a primary key) to allow scalable OLTP workloads to be implemented when retrieving and updating accounts based on an account number access with efficient local indexing. On the other hand, accounts will very often be accessed via customer number as a foreign key with very high selectivity when accessed via an index. Unfortunately, however, the customer number is not the partitioning key on the table, so the transaction will need to be broadcast to all instances in a multi-instance shared-nothing database implementation for most of the products currently in the marketplace. It is impossible to partition the account table on both the account number and customer number simultaneously, so one of the access paths will result in a fundamentally nonscalable transaction workload implementation. General-purpose workloads, with many access paths, make it impossible to partition tables so that all indexed access will be implemented with alignment to partitioning columns for a table.
The problem in the scalability of OLTP workloads described above is a direct result of the local indexing implementation undertaken by most shared-nothing database products. The local indexing approach works very well for most decision-support workloads because indices don't have nearly the selectivity in these environments -- and therefore the penalty for broadcasting queries to all instances isn't nearly so great as it is in an OLTP environment. In fact, if most instances are going to return at least one row anyway (as will usually be the case in DSS workloads), then a broadcast strategy is quite appropriate. For OLTP transactions with selective indexed access, the performance cost of broadcasting to all instances (in overall throughput) will be roughly proportional to the total number of instances in the database configuration. It is for this reason that most shared-nothing database implementations are focused on the DSS marketplace rather than on OLTP. Among the multi-instance implementations of a shared-nothing database architecture, only Tandem's Non-Stop SQL is able to demonstrate good scalability with high-volume OLTP workloads.
Because the global index will be partitioned according to the value of the index columns, access via the index can be provided with explicit routing of the request to the database instance that manages the partition of the index corresponding to the requested value. The global index structure completely avoids the need to broadcast the transaction request to all database instances in the multi-instance implementation. (Note that in the case of an index range scan, multiple instances may be involved in the index access.) The instance that receives the request can efficiently find all database records matching the desired value through use of its partition of the index structure. The matching database records may then be retrieved from whichever database instances happen to have management ownership of the desired records. The partitioned index structure must possess the capability to manage pointers to database records that cross instance boundaries. These index pointers may be physical data block pointers that specify the instance to be targeted along with a data block address within the instance. Alternatively, a logical pointer can be stored that contains the primary key value of the record to be accessed (this technique works best when table partitioning is required to be on a table's primary key). Vendors must consider a variety of performance and maintenance tradeoffs when selecting implementation approaches to pursue within a database product.
There is, of course, an overhead for accessing an index in one instance and then going to a separate instance for the targeted database record. Yet, despite the incremental overhead, this approach is more scalable. As more transaction throughput is required, the architecture allows for the configuration of additional database instances with partitioning on both tables and indices in such a way that no bottlenecks are introduced, either in the form of broadcast transactions or as single points of contention. Only when accessing a table using indices that are not very selective will the cost of the multiple instance coordination for index and record access outweigh the cost of broadcast transactions in a VLDB implementation that has many database instances. In the case of a lack in primary-key type of selectivity for an index -- as is typically found in DSS workloads -- the use of local indexing will often be desirable.
In the future, we believe that both local and global indices will become available for high-end database products. It is likely that DBAs will initially be required to specify what type of structure (local or global) is to be constructed for the columns to be indexed. In the future, it is likely that the DBMS engine itself could determine the most appropriate structure based on the content of the columns to be indexed. Todd Walter, Chief Technical Officer for the Teradata database at NCR, emphasizes that any global indexing scheme must be partitioned in order to avoid single points of contention. As the Teradata database evolves to support a global indexing capability, Walter says that particular attention will be paid to ensuring efficiency and scalability in index maintenance functions for VLDB tables.
Over the last few years, especially as the market trend for data warehousing has really caught on, database vendors have put a great deal of focus on optimizer development. Oracle introduced its cost-based optimizer in Oracle 7.0 and is phasing out the use of rule-based optimization in its product. Gary Hallmark, consulting member of technical staff who focuses primarily on data warehouse issues at Oracle, sees the cost-based optimizer as a key component of Oracle's data warehouse solution with specific features that allow for exploitation of star schema designs, bitmapped indices, and parallel-processing capabilities available with the Oracle Parallel Query option. Hallmark emphasizes the importance of a number of optimization improvements introduced with Oracle 7.3, including "parallel aware'' optimization strategies that enable the optimizer to use parallelism in its execution plans to trade system throughput for better response times on individual queries. Oracle 7.3 also introduced histogrammed statistics for use by the optimizer for better plan generation when skewed data is present within large database tables. The quality of plans generated by Oracle's cost-based optimizer have increased dramatically between Oracle 7.0 and the current 7.3 version of the product, and even further enhancements are expected with the release of Oracle8 later this year.
IBM expects to integrate its advanced research in cost optimization strategies coming out of the Starburst project (undertaken in IBM's Almaden research laboratories in California) into both the MVS and Unix implementations of the DB2 database product. Gilles Fecteau, architect of DB2/6000 Parallel Edition, indicates that customers for this high-end database focused on DSS workloads should see the benefits of the Starburst research in the Common Server implementation of DB2 as early as the first half of 1997. Because the research prototype for IBM's optimizer technology was developed in C++, it will be a much faster path for integrating the advancements into DB2's Common Server database engine (which is also implemented in C++) than will be the case for DB2 on MVS (which is implemented in assembler and PL/X-390).
The Informix XPS database implementation has set a very high performance bar for its competitors with the efficient implementation of a hash join plan that has been made available to the optimizer with pipelining to facilitate parallel execution along multiple dimensions. Oracle has since introduced hash join in its 7.3 database release, and other vendors who have not already implemented hash joins are likely to do so soon. Although hash joins are not relevant to OLTP workloads, for decision support this strategy can deliver remarkable performance advantage over sort-merge joins. The gold standard that all cost-based optimizers are striving for in executing DSS workloads is relative to the Teradata optimizer. The Teradata database implemented a cost-based optimizer totally focused on DSS workloads from day one; it has over 10 years of maturity under its belt directed at this marketplace.
In the big picture, it is generally much more effective to rely on good optimizer strategies than to use the brute-force approach of throwing hardware at a performance problem. With the widespread use of parallel hardware and database platforms, there is often a temptation to use parallel hardware scalability to "hide" performance problems in a data warehouse implementation. However, this solution path can be very expensive. Doubling a hardware configuration may be able to cut response times in half, but good optimizer choices will often result in order-of-magnitude performance benefits. Red Brick Systems Inc.'s DBMS, for example, completely de-emphasizes parallelism in its deployment. Red Brick uses optimization strategies specifically targeted at star schema database designs that are well suited for certain classes of DSS workloads. By making clever use of indexing technologies and optimizing for specific DSS query structures, Red Brick is able to deliver order-of-magnitude better performance characteristics for many decision-support workloads than more traditional relational implementations. The Sybase IQ product has taken a similar approach, with optimization for DSS workloads using a combination of bit indexing and clever table partitioning to deliver high performance. However, for very large data warehouse implementations, Sybase typically recommends deployment of its Sybase MPP product in order to receive the benefits of scalable parallel processing.
The reality is that most data warehouse implementations have not yet run into the technical challenges related to deployment of appropriate optimization strategies for end-user workloads. The critical issues early on in these projects, in fact, have very little to do with database technology per se. The major obstacles have more to do with understanding business requirements and managing end-user expectations than any particular technology issue (in contrast to OLTP implementations, which are primarily focused on the technical challenges related to high availability and performance). Second, data quality is a huge issue in data warehouse implementations because end users are accessing different parts of the data than is typically focused on in OLTP systems. The VLDB marketplace for data warehouse systems still has a lot of maturation ahead of it.