DBMS
 

 

Taming Data Giants

By Stephen Brobst and Owen Robertson
DBMS, February 1997

Getting a Grip on VLDB in the Real World.


Very large database implementations continue to capture the attention of the industry by virtue of their inherent technical challenges and complexity. One common question is, "How big does a database have to be before it is considered a VLDB implementation?" Big, however, is an ambiguous word. Big in terms of what? The first metric that usually comes to mind is to define the "bigness" of a database in terms of its size. Oracle is increasing its focus on the VLDB marketplace, and other databases, such as Teradata, boast reference sites in excess of a terabyte in database size.

You must be quite careful when assessing the database size metric across VLDB implementations. What you include in the database size can be critical in distinguishing among VLDB implementations. Typically there is a huge difference between raw data size, total database size, total usable disk-space size, and total disk-space size (including media protection, such as mirroring or RAID5). (Table 1 lists the commonly used data-storage measurement terms and their approximate size.) For example, at one VLDB site that Strategic Technologies & Systems recently worked with in its DB2/6000 Parallel Edition implementation, a 300GB raw data requirement resulted in a total database size of approximately 950GB once we considered the indices, multidimensional summary tables, result set tables, relational database overhead, sort space, and log space. In addition, working space in excess of 150GB was required for staging, sorting, transforming, and integrating data from multiple mainframe sources prior to loading into the data warehouse. This file-system space was recycled for use by many concurrent SAS analysts for SAS dataset creation when data preparation was not taking place. By the time we performed mirroring for high-availability purposes, the total disk size of the data warehouse platform was well in excess of two terabytes.

For VLDB purposes, I will focus on the database portion of such implementations (excluding working space for staging, external data sets, and so on). Even within this scope, however, the ratio of raw data size to total database size can vary widely across database products and application implementations. Raw data to disk space ratios in recently published TPC-D results indicate that you can expect factors ranging from three to six in raw data size to total database size for most high-end database products targeted for VLDB warehouse implementations. The TPC-D benchmark provides a well-defined set of query and data maintenance functions designed to model a complex decision-support environment. Even when the application is held constant (TPC-D has predefined application characteristics), the implementation of a workload with varying database products can have a large impact on total database size. (See Table 2.) In many cases, the total disk size may be in excess of the storage requirements just to get enough spindles into the configuration to yield desirable performance.

Size alone does not constitute a thoroughly remarkable VLDB implementation. After all, anyone can shove billions of rows into a database. What you do with the data is what really counts. In sizing up the horsepower requirements for a VLDB implementation, you must consider more than just size. My experience consistently points toward three primary factors for distinguishing VLDB implementations: database size, number of concurrently executing transactions or queries, and complexity of transaction or query workload. Because many decision-support systems employ full table scans for significant parts of their workloads, database size is relevant for measuring the workload because scan time relates directly to the size of the tables being scanned. The number of concurrently executing queries in a DSS tends to be one to two times lower than the concurrently executing transactions in a similarly high-end OLTP system. Because OLTP systems typically use selective, indexed access into the database, size is not nearly as relevant a measure of OLTP workload as the number of concurrently executing transactions. Complexity of a transaction or query workload is a huge factor in sizing up the horsepower for either type of system.

In terms of raw data size, there are still only a few database implementations in production in the terabytes range. According to the 1996 VLDB Survey conducted by Winter Corp. (Boston, Mass., see www.wintercorp.com), only the top three OLTP databases and the top four DSS databases had a terabyte or more of raw data in a single database image. Of course, I'm sure that other databases above one terabyte did not respond to the survey for reasons of competitive or defense-related secrecy. However, the landscape of VLDB implementations clearly thins above the one-terabyte level. Wal-Mart continues to lead the pack with its 2TB (raw data) Teradata implementation on the NCR 5100 (containing over 20 billion rows in its largest single table). Several databases in both the OLTP and DSS space have approximately five billion rows (typically in retail, telecommunication, and transportation industries). The number of concurrently executing transactions sustained during peak workload periods in a high-end OLTP system is typically somewhere in the (many) hundreds. These transactions are relatively complex compared to TPC-C transaction definitions. For simple transactions (characterized by the now-defunct TPC-B benchmark), it would not be unusual to see thousands of transactions per second executing on a high-end configuration.

Decision Support

The decision-support databases in most organizations outsize the operational databases by a substantial margin in terms of raw bytes of data storage. This phenomenon occurs for a variety of reasons. First, the operational systems need only relatively recent data to service customers and run today's business. Typically, a purge takes place every 12 to 18 months in the operational systems to eliminate data that is no longer necessary to run the business. The decision-support databases in most organizations maintain data that has much more historical depth than the operational systems. At least two years of history is needed in most organizations to understand trends in the business and facilitate strategic decision-making.

Decision-support databases also must be able to re-create point-in-time snapshots of the business in order to perform as-of analysis for purposes ranging from retrospective pricing analysis to construction of direct-mail response models and credit scorecards. The ability to create point-in-time snapshots requires date segmentation for all major data elements that are volatile over time. The operational systems usually deal with today's view of the data and therefore are not forced to encounter the complexities and storage overhead of date segmentation. Finally, sophisticated decision-support analysis will often require data from sources that are external to the operational systems in order to capture competitive information, market conditions, demographic information about customers, prospect lists with demographics, and so on. The combination of greater historical depth, point-in-time snapshot requirements, and integration of external data will typically drive the size of a decision-support database to a factor of two or three times the size of the operational databases for supporting an organization's data-warehouse capabilities.

The complexity of transactions and queries executed against very large databases varies significantly from site to site and even within a single site. For example, in the OLTP implementation of a VLDB at a large package-shipment company, a simple query may involve looking up a single record to describe the current status and location of a package via selective index access using a shipment number. Within the same system, a concurrently executing transaction may be retrieving all of the package records on inbound trucks to a particular shipping hub, as well as the customer information for those packages, to feed into a complex scheduling algorithm to determine whether or not an outbound truck should be released to its next destination. All transactions must be executed with sub-second response times to maintain workforce productivity and customer satisfaction.

Decision-support queries have a similar wide variance in complexity. Most complex DSS workloads do not consist of single-query executions. Typical applications in a data warehouse environment might involve thousands of lines of SQL to calculate profitability to the customer level using activity-based costing and revenue metrics or to produce at-risk scores for targeting customers for retention programs. It would not be at all unusual to observe as many as 100 SQL statements in such implementations. Each SQL statement selects large portions of the database to use in joins and aggregates across potentially billions of rows, usually calculating metrics from transactional behavior up to the customer level for tens of millions of customers. Most DSS usage will create several intermediate tables in the course of cutting and slicing data iteratively to arrive at the desired answer to an end-user query. One of the largest fallacies in the industry is that DSS queries involve only small answer sets resulting from complex joins. Yes, the joins are very complex, but answer sets are frequently millions of rows in size.

Efficiency in creating large intermediate result sets and extracting these result sets to analytic tools or target lists is one of the biggest shortcomings in many of the parallel databases on the marketplace today. Certainly many end-user questions can be answered with a single SQL statement, but the vast majority of DSS cycles are spent on applications with many SQL statements operating on successive refinements of large data sets through a series of filters and aggregates with other data in a data-warehouse database.

Industry Implementations

The ability to collect data and the need to leverage information for competitive advantage is increasing daily with technical advances. Retail data warehouses traditionally store information at the product-movement level (by SKU or UPC, date, and how many units and dollars were sold). Wal-Mart's famous 2+ terabyte database stores its information in this form. This granularity of storage, however, is not sufficient to support current trends in retailing directed toward better understanding of consumer behavior. According to Randy Mott, CIO of Wal-Mart, the world's largest data warehouse expects to quintuple its database size over the next few years to capture information at the individual market basket level. Rather than merely tracking how much product movement takes place in each store on each day, the next level of analysis in retail is to understand which products sell well with which other products to help plan promotions, store layout, profitability management, and inventory planning. The detail necessary to facilitate this type of analysis requires an unbundling of <SKU, store, date> summaries to allow access to which items were sold in the same market basket.

Capturing customer identity at the point of purchase gives retailers even more analysis capabilities. The inevitability of electronic purchasing as the preferred vehicle at most retailers (through use of smart cards and other devices) facilitates customer identification as a direct outcome of this trend. Additional sourcing of customer demographics and preferences gives retailers a powerful marketing and customer-service tool. Blockbuster Entertainment already identifies its customers at the point of purchase in its video stores and is looking forward to doing the same in its music outlets. Although the volume of data associated to all customers, purchases, and line items in a large retail chain may seem overwhelming in terms of the multi-terabyte database requirements, it is the intelligent use of this information that lets the likes of Wal-Mart and Blockbuster expand domestically and internationally.

Retail is not the only industry experiencing a trend toward increasingly large VLDB implementations. The emphasis in the marketplace on superior customer service has implications across the board. For example, in package shipping (an industry where many of the largest databases exist today), the granularity at which tracking is now being performed lets customers see exactly where a package is at any point in time during its transit. When a shipment consists of multiple packages, tracking is performed at both the shipment and individual package level to ensure accuracy in understanding when packages within a shipment get separated or misrouted. These package tracking databases were already among the largest in the world when they contained one record for each package along with its current shipping status. Now, consider the movement toward retaining scan-level data records for each time a package changes hands in the shipping process. A package may experience a dozen or more scans between its origination and destination points. The largest databases in the world are on their way to being 12 times larger. Capture of the scan-level data records lets you manage each step of service levels much more proactively in a business that ships millions upon millions of packages per day.

The Advent of Automation

The movement toward automation of customer interactions across most industries also increases the volume of data collected in what are already VLDB environments. Many of the leading financial services and insurance industry participants are introducing online kiosks that let customers browse product offerings and explore dimensions of interest to identify the appropriate suite of products and services for their needs. Nationwide Insurance, one of the largest insurance companies in the United States, is piloting kiosks in its offices now and has high hopes for the information that will be available through this technology. By capturing keystrokes within the navigation system of the browsing program, it is possible to identify exactly which information has been accessed and the sequence in which the customer examined the information. Understanding which aspects of the product and service information a customer demonstrates interest in, and the order in which that information is examined, provides a wealth of knowledge about customers who use the kiosks.

The deployment of automated kiosks represents only a small preview of what will be possible with large-scale adoption of Internet technology for accessing product information. Today, most Web sites on the Internet track the identity of the (potential) customer accessing the site. Web server logs record each page that a user accesses. Sophisticated log-analysis programs can track the user's movement within a Web site.

The ability to track information at a granular level regarding customer interactions with a well-designed Web site presents unprecedented opportunities for building up customer understanding on an individual basis, because the penetration of Internet "shopping" is predicted to grow exponentially over the next few years. The implications of tracking interactions with Web sites for consumer access is analogous to capturing a database of all the products that a customer examines in a store -- even the ones that aren't purchased. Moreover, in a well-designed Web site, it should be possible to infer what dimensions of the product were of interest and what products were compared before a purchase decision was made. Of course, this makes for a very large database to capture and analyze all of the information that is acquired under this new paradigm.

Dan Jaye, CTO at CMG Direct Interactive (based in Andover, Mass.), warns that you must be selective in what to capture from the Internet to avoid being overwhelmed by data (as opposed to information). A typical retailer would stock 100,000 SKUs in a large store; Jaye's team has over 70 million Web pages cataloged for tracking. Issues such as time relevance and statistical significance in analyzing sparse patterns of usage become critical when mining data of such proportion. CMG Direct Interactive currently has 1.6 terabytes of DASD on the floor managed by an Informix XPS database for CMG's initial prototype, and the company will be scaling up to a multi-terabyte implementation for the production deployment. The design point for CMG's system is to handle data accumulation at upwards of 100,000 Internet transactions per day.

Another significant factor in the growth of very large databases is the onslaught of object database technology as a means of expanding the types of information that can be effectively accessed and stored within the domain of a VLDB implementation. The merger of Illustra and Informix in the Universal Server product, as well as object features that will be made available in Oracle 8, clearly set the groundwork for integration of multimedia objects with data stored in a traditional relational framework. NCR Corp.'s Teradata database has also been extended with multimedia object capability in the research prototype delivered as part of the Moonbase project in cooperation with the National Institute of Health. NCR has announced that multimedia object capability and content-search functionality resulting from this project will become available in the Teradata database sometime in 1997, and a number of beta sites are currently being identified.

VLDB and Healthcare

The healthcare industry is likely to be one of the largest beneficiaries of the impending merger of relational and object technologies. The nirvana in healthcare is to provide an online version of an individual's complete medical history. Structured data, as well as X-ray images, physician (free text) notes, and a variety of unstructured data, could be made instantly accessible to the appropriate healthcare providers -- regardless of the geography in which treatment is to be administered. Sullivan McConnell, VP at Aetna Health Plans, envisions a future in which there will be a need to provide access to all of the relevant information about an individual to any appropriate provider within the healthcare delivery system -- regardless of when and where such services are rendered. Realization of this vision requires that all information about a patient's medical treatments, lab results, drug prescriptions, adverse reactions, links to hereditary medical histories, and much more will need to be stored in online databases with instantaneous access around the world.

Many challenges on the road to online medicine make this a difficult vision to bring to fruition. Standardization across healthcare providers for critical information is much less than ideal; even the integration of medical, mental health, pharmacy, and other data is a big undertaking for many health care providers because of the silo systems that exist in most large organizations. Make no mistake, however: The big push is on, and healthcare is likely to emerge as an industry that will drive some of the largest databases in the world.

VLDB and Manufacturing

The manufacturing industry is also likely to be a sleeping giant as far as VLDB implementations are concerned. There is much activity in the realm of data collection around manufacturing processes to monitor and provide quality control in this environment. More and more, historical data is being used to identify patterns in manufacturing failures related to specific processes, parts suppliers, and so on. It is not uncommon for the quality control in the manufacturing of certain types of goods to involve hundreds of tests, each giving critical information about the adherence to specifications of a final product or subassembly. Object-based database technology is often used to capture the complex relationships that exist among parts, subassemblies, and final products. For critical parts or subassemblies, it is often the case that X-ray images may even be taken and stored as part of the quality-control process. For large-scale manufacturers -- particularly those that make use of X-ray imaging in their quality-control processes -- VLDB implementations with object technology are already on their planning boards.

Money Matters

As disk prices continue to plummet, the economic implications of a VLDB implementation become increasingly attractive. The return on investment for incremental knowledge about increasingly competitive business environments or enhanced customer-service capability is becoming more and more attractive, especially when the cost of storing data starts to drop to mere pennies on the megabyte. Today, VLDB implementations are using storage technology that is already well below 50 cents per megabyte. Richard Winter, president of Winter Corp., predicts that disk prices are likely to drop below a nickel for a megabyte of storage by the year 2000. This price drop will surely increase the demand for database software that is capable of effectively managing tens of terabytes of data. Yet in 1996 only an elite few have tackled the thorny issues related to managing databases of this magnitude.

The issues of performance management, backup and recovery, and high availability take on new meanings when databases reach VLDB status. The importance of indexing strategies and database optimizer capability are two key areas that become critical when focusing on performance management for high-end applications systems. I will examine both areas in the next installment of this article, which will appear in an upcoming issue of DBMS. I will also explore parallel processing and three-tier architectures as they relate to VLDB technology. With all of the innovation now taking place at the high end of the marketplace, 1997 is rapidly evolving to be a landmark year for the introduction of VLDB enabling technologies.


Stephen Brobst is CTO at Tanning Technology Corp. (Denver, Colo.) and founder of Strategic Technologies & Systems (Boston, Mass.). You can email Stephen at sabrobst@tanning.com.

Owen Robertson is a senior DBA at Tanning Technology Corp. You can email Owen at oroberts@tanning.com.





TABLE 1. Data-storage measurement terms and their approximate size.

TermApproximate Size
byte8 bits
kilobyte1,000 bytes
megabyte1 million bytes
gigabyte1 billion bytes or 1,000 megabytes
terabyte1 trillion bytes or 1,000 gigabytes
petabyte1 quadrillion bytes or 1,000 terabytes




TABLE 2. TPC-D Chart.

Hardware DatabaseRaw DataTotal DiskRatio
Digital 8400Oracle100GB361GB3.61
HP 9000Oracle100GB643GB6.43
IBM SP2DB2/6000 PE100GB377GB3.77
NCR 5100Teradata100GB880GB8.80
Sun UEOracle100GB594GB5.94
Tandem K10000Non-stop SQL100GB286GB2.86
IBM SP2Oracle300GB1,977GB6.59
Pyramid RM1000Oracle300GB1,535GB5.12
NCR 5100Teradata300GB880GB2.93
NCR 5100Teradata1,000GB3,280GB3.28


Subscribe to DBMS and Internet Systems -- It's free for qualified readers in the United States
February 1997 Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS and Internet Systems (http://www.dbmsmag.com)
Copyright © 1997 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated Wednesday, January 22, 1997.