Getting a Grip on VLDB in the Real World.
Very large database implementations continue to capture the attention of the industry by virtue of their inherent technical challenges and complexity. One common question is, "How big does a database have to be before it is considered a VLDB implementation?" Big, however, is an ambiguous word. Big in terms of what? The first metric that usually comes to mind is to define the "bigness" of a database in terms of its size. Oracle is increasing its focus on the VLDB marketplace, and other databases, such as Teradata, boast reference sites in excess of a terabyte in database size.
You must be quite careful when assessing the database size metric across VLDB implementations. What you include in the database size can be critical in distinguishing among VLDB implementations. Typically there is a huge difference between raw data size, total database size, total usable disk-space size, and total disk-space size (including media protection, such as mirroring or RAID5). (Table 1 lists the commonly used data-storage measurement terms and their approximate size.) For example, at one VLDB site that Strategic Technologies & Systems recently worked with in its DB2/6000 Parallel Edition implementation, a 300GB raw data requirement resulted in a total database size of approximately 950GB once we considered the indices, multidimensional summary tables, result set tables, relational database overhead, sort space, and log space. In addition, working space in excess of 150GB was required for staging, sorting, transforming, and integrating data from multiple mainframe sources prior to loading into the data warehouse. This file-system space was recycled for use by many concurrent SAS analysts for SAS dataset creation when data preparation was not taking place. By the time we performed mirroring for high-availability purposes, the total disk size of the data warehouse platform was well in excess of two terabytes.
For VLDB purposes, I will focus on the database portion of such implementations (excluding working space for staging, external data sets, and so on). Even within this scope, however, the ratio of raw data size to total database size can vary widely across database products and application implementations. Raw data to disk space ratios in recently published TPC-D results indicate that you can expect factors ranging from three to six in raw data size to total database size for most high-end database products targeted for VLDB warehouse implementations. The TPC-D benchmark provides a well-defined set of query and data maintenance functions designed to model a complex decision-support environment. Even when the application is held constant (TPC-D has predefined application characteristics), the implementation of a workload with varying database products can have a large impact on total database size. (See Table 2.) In many cases, the total disk size may be in excess of the storage requirements just to get enough spindles into the configuration to yield desirable performance.
Size alone does not constitute a thoroughly remarkable VLDB implementation. After all, anyone can shove billions of rows into a database. What you do with the data is what really counts. In sizing up the horsepower requirements for a VLDB implementation, you must consider more than just size. My experience consistently points toward three primary factors for distinguishing VLDB implementations: database size, number of concurrently executing transactions or queries, and complexity of transaction or query workload. Because many decision-support systems employ full table scans for significant parts of their workloads, database size is relevant for measuring the workload because scan time relates directly to the size of the tables being scanned. The number of concurrently executing queries in a DSS tends to be one to two times lower than the concurrently executing transactions in a similarly high-end OLTP system. Because OLTP systems typically use selective, indexed access into the database, size is not nearly as relevant a measure of OLTP workload as the number of concurrently executing transactions. Complexity of a transaction or query workload is a huge factor in sizing up the horsepower for either type of system.
In terms of raw data size, there are still only a few database implementations in production in the terabytes range. According to the 1996 VLDB Survey conducted by Winter Corp. (Boston, Mass., see www.wintercorp.com), only the top three OLTP databases and the top four DSS databases had a terabyte or more of raw data in a single database image. Of course, I'm sure that other databases above one terabyte did not respond to the survey for reasons of competitive or defense-related secrecy. However, the landscape of VLDB implementations clearly thins above the one-terabyte level. Wal-Mart continues to lead the pack with its 2TB (raw data) Teradata implementation on the NCR 5100 (containing over 20 billion rows in its largest single table). Several databases in both the OLTP and DSS space have approximately five billion rows (typically in retail, telecommunication, and transportation industries). The number of concurrently executing transactions sustained during peak workload periods in a high-end OLTP system is typically somewhere in the (many) hundreds. These transactions are relatively complex compared to TPC-C transaction definitions. For simple transactions (characterized by the now-defunct TPC-B benchmark), it would not be unusual to see thousands of transactions per second executing on a high-end configuration.
Decision-support databases also must be able to re-create point-in-time snapshots of the business in order to perform as-of analysis for purposes ranging from retrospective pricing analysis to construction of direct-mail response models and credit scorecards. The ability to create point-in-time snapshots requires date segmentation for all major data elements that are volatile over time. The operational systems usually deal with today's view of the data and therefore are not forced to encounter the complexities and storage overhead of date segmentation. Finally, sophisticated decision-support analysis will often require data from sources that are external to the operational systems in order to capture competitive information, market conditions, demographic information about customers, prospect lists with demographics, and so on. The combination of greater historical depth, point-in-time snapshot requirements, and integration of external data will typically drive the size of a decision-support database to a factor of two or three times the size of the operational databases for supporting an organization's data-warehouse capabilities.
The complexity of transactions and queries executed against very large databases varies significantly from site to site and even within a single site. For example, in the OLTP implementation of a VLDB at a large package-shipment company, a simple query may involve looking up a single record to describe the current status and location of a package via selective index access using a shipment number. Within the same system, a concurrently executing transaction may be retrieving all of the package records on inbound trucks to a particular shipping hub, as well as the customer information for those packages, to feed into a complex scheduling algorithm to determine whether or not an outbound truck should be released to its next destination. All transactions must be executed with sub-second response times to maintain workforce productivity and customer satisfaction.
Decision-support queries have a similar wide variance in complexity. Most complex DSS workloads do not consist of single-query executions. Typical applications in a data warehouse environment might involve thousands of lines of SQL to calculate profitability to the customer level using activity-based costing and revenue metrics or to produce at-risk scores for targeting customers for retention programs. It would not be at all unusual to observe as many as 100 SQL statements in such implementations. Each SQL statement selects large portions of the database to use in joins and aggregates across potentially billions of rows, usually calculating metrics from transactional behavior up to the customer level for tens of millions of customers. Most DSS usage will create several intermediate tables in the course of cutting and slicing data iteratively to arrive at the desired answer to an end-user query. One of the largest fallacies in the industry is that DSS queries involve only small answer sets resulting from complex joins. Yes, the joins are very complex, but answer sets are frequently millions of rows in size.
Efficiency in creating large intermediate result sets and extracting these result sets to analytic tools or target lists is one of the biggest shortcomings in many of the parallel databases on the marketplace today. Certainly many end-user questions can be answered with a single SQL statement, but the vast majority of DSS cycles are spent on applications with many SQL statements operating on successive refinements of large data sets through a series of filters and aggregates with other data in a data-warehouse database.
Capturing customer identity at the point of purchase gives retailers even more analysis capabilities. The inevitability of electronic purchasing as the preferred vehicle at most retailers (through use of smart cards and other devices) facilitates customer identification as a direct outcome of this trend. Additional sourcing of customer demographics and preferences gives retailers a powerful marketing and customer-service tool. Blockbuster Entertainment already identifies its customers at the point of purchase in its video stores and is looking forward to doing the same in its music outlets. Although the volume of data associated to all customers, purchases, and line items in a large retail chain may seem overwhelming in terms of the multi-terabyte database requirements, it is the intelligent use of this information that lets the likes of Wal-Mart and Blockbuster expand domestically and internationally.
Retail is not the only industry experiencing a trend toward increasingly large VLDB implementations. The emphasis in the marketplace on superior customer service has implications across the board. For example, in package shipping (an industry where many of the largest databases exist today), the granularity at which tracking is now being performed lets customers see exactly where a package is at any point in time during its transit. When a shipment consists of multiple packages, tracking is performed at both the shipment and individual package level to ensure accuracy in understanding when packages within a shipment get separated or misrouted. These package tracking databases were already among the largest in the world when they contained one record for each package along with its current shipping status. Now, consider the movement toward retaining scan-level data records for each time a package changes hands in the shipping process. A package may experience a dozen or more scans between its origination and destination points. The largest databases in the world are on their way to being 12 times larger. Capture of the scan-level data records lets you manage each step of service levels much more proactively in a business that ships millions upon millions of packages per day.
The deployment of automated kiosks represents only a small preview of what will be possible with large-scale adoption of Internet technology for accessing product information. Today, most Web sites on the Internet track the identity of the (potential) customer accessing the site. Web server logs record each page that a user accesses. Sophisticated log-analysis programs can track the user's movement within a Web site.
The ability to track information at a granular level regarding customer interactions with a well-designed Web site presents unprecedented opportunities for building up customer understanding on an individual basis, because the penetration of Internet "shopping" is predicted to grow exponentially over the next few years. The implications of tracking interactions with Web sites for consumer access is analogous to capturing a database of all the products that a customer examines in a store -- even the ones that aren't purchased. Moreover, in a well-designed Web site, it should be possible to infer what dimensions of the product were of interest and what products were compared before a purchase decision was made. Of course, this makes for a very large database to capture and analyze all of the information that is acquired under this new paradigm.
Dan Jaye, CTO at CMG Direct Interactive (based in Andover, Mass.), warns that you must be selective in what to capture from the Internet to avoid being overwhelmed by data (as opposed to information). A typical retailer would stock 100,000 SKUs in a large store; Jaye's team has over 70 million Web pages cataloged for tracking. Issues such as time relevance and statistical significance in analyzing sparse patterns of usage become critical when mining data of such proportion. CMG Direct Interactive currently has 1.6 terabytes of DASD on the floor managed by an Informix XPS database for CMG's initial prototype, and the company will be scaling up to a multi-terabyte implementation for the production deployment. The design point for CMG's system is to handle data accumulation at upwards of 100,000 Internet transactions per day.
Another significant factor in the growth of very large databases is the onslaught of object database technology as a means of expanding the types of information that can be effectively accessed and stored within the domain of a VLDB implementation. The merger of Illustra and Informix in the Universal Server product, as well as object features that will be made available in Oracle 8, clearly set the groundwork for integration of multimedia objects with data stored in a traditional relational framework. NCR Corp.'s Teradata database has also been extended with multimedia object capability in the research prototype delivered as part of the Moonbase project in cooperation with the National Institute of Health. NCR has announced that multimedia object capability and content-search functionality resulting from this project will become available in the Teradata database sometime in 1997, and a number of beta sites are currently being identified.
Many challenges on the road to online medicine make this a difficult vision to bring to fruition. Standardization across healthcare providers for critical information is much less than ideal; even the integration of medical, mental health, pharmacy, and other data is a big undertaking for many health care providers because of the silo systems that exist in most large organizations. Make no mistake, however: The big push is on, and healthcare is likely to emerge as an industry that will drive some of the largest databases in the world.
The issues of performance management, backup and recovery, and high availability take on new meanings when databases reach VLDB status. The importance of indexing strategies and database optimizer capability are two key areas that become critical when focusing on performance management for high-end applications systems. I will examine both areas in the next installment of this article, which will appear in an upcoming issue of DBMS. I will also explore parallel processing and three-tier architectures as they relate to VLDB technology. With all of the innovation now taking place at the high end of the marketplace, 1997 is rapidly evolving to be a landmark year for the introduction of VLDB enabling technologies.
Owen Robertson is a senior DBA at Tanning Technology Corp. You can email Owen at oroberts@tanning.com.
TABLE 1. Data-storage measurement terms and their approximate size. | |
| Term | Approximate Size |
|---|---|
| byte | 8 bits |
| kilobyte | 1,000 bytes |
| megabyte | 1 million bytes |
| gigabyte | 1 billion bytes or 1,000 megabytes |
| terabyte | 1 trillion bytes or 1,000 gigabytes |
| petabyte | 1 quadrillion bytes or 1,000 terabytes |
TABLE 2. TPC-D Chart. | ||||
| Hardware | Database | Raw Data | Total Disk | Ratio |
|---|---|---|---|---|
| Digital 8400 | Oracle | 100GB | 361GB | 3.61 |
| HP 9000 | Oracle | 100GB | 643GB | 6.43 |
| IBM SP2 | DB2/6000 PE | 100GB | 377GB | 3.77 |
| NCR 5100 | Teradata | 100GB | 880GB | 8.80 |
| Sun UE | Oracle | 100GB | 594GB | 5.94 |
| Tandem K10000 | Non-stop SQL | 100GB | 286GB | 2.86 |
| IBM SP2 | Oracle | 300GB | 1,977GB | 6.59 |
| Pyramid RM1000 | Oracle | 300GB | 1,535GB | 5.12 |
| NCR 5100 | Teradata | 300GB | 880GB | 2.93 |
| NCR 5100 | Teradata | 1,000GB | 3,280GB | 3.28 |