DBMS, June 1998
DBMS Online: Server Side By Martin Rennhackkamp

Red Brick Warehouse 5.1

Relational data warehousing and data mining are reaching new heights.


Last year my colleagues and I designed and implemented a system for a telecommunications company. The system transforms and loads circuit fault data (or trouble tickets) every night from CA-OpenIngres operational databases to a CA-OpenIngres operational data store (ODS). From there the data is loaded weekly into an Oracle marketing data warehouse located on the other side of the country. Fault center managers use the ODS daily to determine customer satisfaction levels, adherence to service level agreements, and fault-handler performance statistics. Corporate decision makers use the marketing data warehouse for planning and long-term strategies. Some of them extract personal data marts from the marketing data warehouse. Although data warehouse experts advocate merging the ODS into the data warehouse (see Ralph Kimballıs "Data Warehouse Architect" column in the December 1997 edition of DBMS), this companyıs politics and operational procedures and the skills of the development team necessitated this architecture. However, the fact that this project is now in a second phase that caters to new data requirements proves that it works. The bottom line is that you can use a plain-vanilla relational DBMS normally used for OLTP successfully as a data warehouse server.

Red Brick Systems Inc. has challenged this notion by implementing several features in its Red Brick database server that are uniquely suited to decision-support processing. First released in 1990, Red Brick Warehouse is an RDBMS designed exclusively for data warehouse processing (in the broadest sense of the word). I have decided to review the latest release, Red Brick Warehouse 5.1, in this monthıs column.

A recent press release touts Red Brick as the worldıs fastest and most scalable relational database for data warehousing, data marts, OLAP, and data mining. I wrote the phrase "worldıs fastest and most scalable" off as blatant American-style advertising, but I will nonetheless investigate this claim later on. What interested me more was "relational database for ı OLAP and data mining." I knew from experience that you can use a relational database for data warehouses and data marts, but I have always had the impression that you need special functionality to perform data mining and that for OLAP you required a multidimensional data model to drill down, up, across, and around the various dimensions. For relational OLAP (ROLAP) you need to impose a multidimensional model on top of the relational database. How do you combine that in the same database with the star and snowflake data models used for data warehouses and data marts?

Relational Warehouse

Although I have successfully used an RDBMS for data warehousing, you cannot use any old RDBMS for large-scale, in-depth data warehousing. It must provide data loading and processing, advanced query processing, dimensional analysis, connectivity, data integrity control, and database administration, as well as VLDB, parallel, and concurrent user scalability, with good performance and efficient resource utilization to boot.

Where OLTP systems typically focus on the performance of repeated insert and update operations and a number of well-defined queries, the data warehouse must support ad hoc queries of any complexity. Data warehouse users issue different types of queries than OLTP users, and they require additional SQL functions for decision-support queries. A single query on a data warehouse can access thousands or millions of rows in a different fashion from a typical OLTP transaction. A data warehouse often has to support a large number of these queries concurrently. My view is that the types of queries supported by the DBMS depend greatly on its query optimizer, query processor, and in a way, by the extensibility of the DBMS where you can add user-defined functions. The number of concurrent users depends on the concurrency control mechanisms and the scalability of the DBMS.

Loading a data warehouse is complex and typically involves a variety of sources, including legacy, file-based, and relational systems. Conventional relational bulk-load and replication mechanisms cannot perform complex data transformations, nor can they load a data warehouse fast enough. The simple relational bulk loaders perform bulk loading as a series of steps, and they cannot perform much processing during the load.

Database administration and maintenance tasks can inhibit the availability of an online data warehouse. The problem is that routine database maintenance activities on a conventional relational OLTP database become prohibitive on a large-scale data warehouse database. Reindexing and query optimizer statistics sampling, which are periodically run against most relational databases, are not practical against a terabyte-sized data warehouse. A reindexing run isnıt useful if it locks the data warehouse tables for two days. Having juggled cron scripts to run efficiently one after the other in the dead of the night, I cannot stress enough that an RDBMS used for data warehousing should not require any physical reorganization.

There are additional requirements, such as space- and time-efficient backup and recovery mechanisms, but I donıt have room to cover all that here. It is important to keep in mind, however, that these facilities also play a role in the other tasks, such as data loading. If a bulk load process fails, recovery and restart must take place in a very short window ı among other reasons, to allow the reload process enough time. Traditional OLTP transaction log-based recovery facilities impose too many overheads for the large-scale loading and updating performed in data warehouse environments.

Red Brick Warehouse

The current release of Red Brick Warehouse consists of three components: a database server, a load subsystem, and gateway technologies for client/server access.

Red Brickıs relational database server was designed to support databases typically larger than 500GB with billions of records. It uses compact representations for numeric data and compressed indexes to reduce the storage space requirements. On the appropriate hardware platforms, it employs parallel scanning, parallel joining, and trademarked technologies it calls parallel-on-demand and parallel SuperScan. Using parallel-on-demand, the Red Brick query analyzer partitions queries for the optimal degree of parallelism, where it considers the queryıs complexity, the tablesı partitioning, and the available resources. For example, it allocates less parallel processing on a busy system than on an idle system. The query analyzer also uses a technique called dynamic incremental query optimization, where a queryıs execution plan is optimized as it is executed ı meaning that the results of the intermediate steps are used to determine the execution methods for the subsequent steps. In this way, it eliminates the performance problems sometimes introduced when a query optimizer uses out-of-date statistics.

Red Brick supports conventional B-tree, star, and target indexes for different types of queries. Star indexes are automatically built when tables are created ı they are used to accelerate join queries over primary and foreign key relationships. It is interesting to note that a star index contains entries for both the foreign key and the primary key tables, which I have not seen in any other DBMS. To overcome the performance problems of conventional sequential pairwise joins in star schemas, Red Brick uses a high-speed, single-pass, parallelizable, multitable join algorithm called StarJoin, which uses star indexes extensively.

Target indexes are low-maintenance adaptive bitmap indexes specially designed for dynamic large and wide tables with many repeating values. A target index can have one of three different subtypes, called domains, each optimized for the domain size of the indexed column. A small domain is a true bitmap designed for columns with two to 50 distinct values (such as gender: male vs. female). A medium domain is a compressed bitmap using a delta-encoding method designed for columns with 50 to 500 distinct values (such as country: U.S., China, U.K., France, India, and so on). A large domain is a noncompressed row ID (RID) list for columns with 500 to several thousand distinct values, such as ZIP code.

Red Brickıs "hybrid" index type can mix the domains for a column to neutralize the effects of a skewed data distribution. Every different column value (not just column) can have a different index representation. For example, for an organization in California, Red Brick will use a small domain uncompressed bitmap to index the "CA" values and a medium domain for all the other domain values. Red Brick uses target indexes in its new TargetJoin algorithms to apply sets of restrictions and bitmap StarJoin operations to multiple tables ı this can be applied to any type of join operation, not just star schemas. Traditional bitmap indexes are usually applied to restriction queries on a single table, but Red Brick Warehouse uses TargetJoins to evaluate a number of restrictions on multiple tables in parallel. It uses intersection operations on the resulting bitmap vectors to evaluate the restrictions on multiple tables together.

Red Brick has an extended SQL, called Red Brick Intelligent SQL (RISQL), specially developed for decision-support queries. It supports business analysis queries through sequential, rank, running total, moving average, and ratio functions. It has additional numeric and string functions and macros to simplify repetitive queries and calculations.

The high-performance load subsystem, called the Table Management Utility (TMU), was designed to populate data warehouses quickly and efficiently. It provides data aggregation, conversion, and transformation utilities, for example, to mask out or merge specific fields from an input data stream and roll up totals through data hierarchies. It performs these steps, checks the integrity, and updates all the necessary indexes in one integrated run. As part of this process, it can insert missing keys in lookup tables to satisfy the referential checks. It does as much processing in memory and data loading in parallel as possible. To reduce I/O, it uses block space allocation for index builds and table extensions.

Red Brick Data Mine Option

Instead of taking the data to a mining tool, Red Brick takes the data mining tool to the data. This is the result of a collaboration with DataMind Corp. whereby DataMindıs neural network, decision tree, and statistical algorithms have been incorporated into the core Red Brick RDBMS server. As a result, you can perform data mining directly on the Red Brick Warehouse database through the Red Brick Data Mine option.

Through the Red Brick Data Mine option, you can also create multidimensional models that appear as tables in the database. When you insert data into the data warehouse tables, all the data mining calculations are performed and the results are stored in the model tables, which can be analyzed or "mined" using RISQL. The model tables can be created using the GUI-based Red Brick Data Mine builder or through extended SQL statements issued directly to the warehouse DBMS server.

This approach to OLAP and data mining saves a considerable amount of data extraction, transformation, shipping, and loading, not to mention the redundant storage of similar information in different databases. It also saves on administrative procedures, where security and user administration tasks are only performed once on the data warehouse database. Users can use the same front-end tool for interrogating the data warehouse data and for mining and analyzing the data. Most of these tools can generate SQL queries, which, as I already mentioned, are executed efficiently by the Red Brick server. The biggest advantage of this approach is that when analysts want to perform drill-down analysis, the detailed data is at hand in the same database, and the mappings between the OLAP data and the warehouse data are defined in the metadata.

Release 5.1

The new features in Red Brick Warehouse 5.1 aim to improve ease of use and administration. These include the trademarked Red Brick Vista and Red Brick SQL-BackTrack technologies, TargetJoins, the Red Brick Warehouse Administrator, Integrated Enterprise Control and Reporting, and various query and index optimizations.

Red Brick Vista, a component of the Red Brick server, in turn consists of three additional components for aggregate life cycle management ı that is, to manage and process aggregate queries. It comprises an Aggregate Advisor, a Transparent Query Rewrite function, and a metadata layer. The Aggregate Advisor utility can audit selected aggregates and compare their estimated gains to actual system usage. The Aggregate Advisor uses a query log to generate a comprehensive cost lattice ı a structure that defines the interrelationships and dependencies between various aggregates to calculate a cost formula for each potential aggregate. The compiled results let DBAs choose which aggregates to create. The Transparent Query Rewrite function transforms SQL queries to use the aggregates. It analyzes complex SQL queries including queries with subqueries; UNION, EXCEPT, and INTERSECT queries; and the SELECT part of INSERT SELECT statements. It automatically transforms the queries to use the stored aggregates. This is a useful feature; some query and reporting packages generate the most cumbersome SQL statements. An administrator can edit the aggregation strategies without affecting existing applications and queries. The metadata layer contains the definitions of the aggregates and the tables on which they are based. The Transparent Query Rewrite facility, the Red Brick parallel loader (TMU, which loads aggregates automatically when a base table is updated), and the Aggregate Advisor uses the metadata layer. The metadata layer is also where you store the definitions of dimensional hierarchies ı that is, successive levels of aggregation (such as time dimension of day, month, year) used in drill-down and roll-up operations.

The old Backup/Restore option has been replaced with SQL-BackTrack for Red Brick Warehouse, the result of a joint development effort between Red Brick and BMC Software. SQL-BackTrack for Red Brick Warehouse is an administrative utility to manage fast backup and recovery. It supports online, incremental, and parallel backups. With online backup, warehouse users can continue working while backups are performed. Incremental backups only back up changed data. With parallel backups, the warehouse database can be backed up to multiple devices simultaneously. You can use data compression to reduce backup times even more and save storage space. The recovery time is fast because SQL-BackTrack restores full and incremental backups in the appropriate order to get the database users back to work with minimum downtime. SQL-BackTrackıs administrative utility lets you preconfigure backups, which makes it possible to run backups unattended from a single command. SQL-BackTrack automatically records information about backups, including time and date, backup type, backup media, and database allocation. Similarly, it guides the user through the recovery process. The Open Backup Stream Interface (OBSI) available from BMC lets administrators send physical backups and exports directly to tape, magnetic or optical disk, or third-party storage management systems. SQL BackTrack has a dry-run recovery verification feature, which lets you check the validity of a backup before it is restored.

Red Brick Warehouse Administrator is a GUI-based administrative tool, available on Windows NT and Windows 95, used to control all the data warehousing tasks on all the platforms that Red Brick Warehouse supports. It is especially focused on segmentation and partitioning. It contains wizards, online context-sensitive help, and what-if analyses to help DBAs through schema changes. (See Figure 1.) The Enterprise Control and Coordination (ECC), Copy Management, and RISQL Reporter modules previously available as additional modules are now included in the base product.

Red Brick Warehouse 5.1 contains enhancements to query and index optimizations, mostly to improve the performance of count(), min(), max(), aggregation, and group by functions, without any administrator or user intervention, such as hints included in the code.

Red Brick Warehouse 5.1 supports ANSI SQL-92 standard temporary tables for work areas and intermediate results. These tables, which are only visible in the session in which they are created, behave like normal tables but do not incur overhead on the system catalogs, nor can they have any constraints. Increasingly, third-party tools require these temporary tables for complex OLAP and multipass SQL operations.

Limitations

All queries submitted to a Red Brick Warehouse database are subject to a limit of 8K on the row size of the intermediate and final result tables. This limit is the maximum size of a row in a table. It is important to consider this limit, not only when designing the warehouse data model, but also when considering all the possible queries the users may want to make. If they join many large descriptive columns from the dimension tables to a wide fact table, they could exceed this limit.

The Red Brick Warehouse database server has a default stack size of 5MB. The server will fail if it runs out of stack space, for example, while processing a large query that involves a large IN list. For extensive data mining operations, this may become problematic.

The SQL UPDATE, TMU LOAD UPDATE, and TMU LOAD MODIFY operations can fail on some rows. This happens when the table contains more than one segment, is segmented by ranges or like data, the segmenting column is updated, and there is a star index that references the table. The update operation would cause the row to move to another segment. If you need to update rows in this scenario, you have to drop the star index, perform the update operation, and then re-create the star index.

You cannot interrupt an optimized load after it has started the merge phase of the index-building process. If you force an interrupt after the merge phase has begun, you might invalidate the indexes being built. You should consider this when scheduling large data loads.

When the TMU is used to load more than 232 (more than four billion) rows into a table, a warning message indicates that the RBW_LOADINFO system table will record incorrect row counts beyond that point. However, the TMU will continue to load the remaining rows and no user intervention is required.

If you are using Red Brick Vista, SELECT DISTINCT queries are converted to equivalent group by queries so they can be rewritten to use precomputed views. However, this conversion does not take place when the select distinct operation is part of a subquery. This problem will be corrected in a future release.

You may get a nasty shock when updating column values. Values in a NOT NULL column should not be updated with NULL values ı it should return an error message. However, the current behavior is that the UPDATE statement executes without either updating the NOT NULL rows or warning the user that the update is prohibited. RedBrick insists that this problem will be corrected in a future release.

When a query accesses one or more offline aggregate table segments, and PARTIAL AVAILABILITY configuration parameter is set to the default PRECHECK mode, the query will fail to execute. In a future release, only those precomputed views whose aggregate tables do not have any offline segments will be considered for rewritten queries.

Data Warehousing Made Easier

RedBrickıs table and index representations, together with its query execution strategies, are aimed at carrying out any data warehouse queries of any complexity as fast as possible on very large data warehouses. You can also see that they cater to large data warehouses in their loading, administration, and backup and recovery facilities, all of which promote segmentation and a high degree of parallelism.

The OLAP functionality built into the relational database server is a very interesting feature. This approach is quite unique, as most other vendors view OLAP as a specialized subject area, requiring a specialized multidimensional server. This approach of taking the OLAP functions to the data, instead of taking the data to the OLAP server, saves a considerable amount of data duplication and unload/reload processing. It also maintains the links between the multidimensional views, and the warehouse data, which is extremely useful for drill-down operations.

Our client is already running a number of RDBMSs in various splinter groups throughout the organization, against its "standardization policy." Would it have approved the acquisition of yet another RDBMS ı at what seemed to me quite a stiff price? Iım not too sure. However, it would have made our life a lot easier to implement and manage our ODS and data warehouse application using such an RDBMS specially designed for the task at hand.



Figure 1. Red Brick Warehouse Administrator contains wizards, online context-sensitive help, and what-if analyses to help DBAs through schema changes.


Martin Rennhackkamp is the owner and principal consultant of The Data Base Approach, a corporation specializing in relational and distributed databases, based in Cape Town, South Africa. You can reach Martin via the Internet at mr@dba.co.za.
What did you think of this article? Send a letter to the editor.


Subscribe to DBMS -- It's free for qualified readers in the United States
June 1998 Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS (http://www.dbmsmag.com)
Copyright © 1998 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated May 6, 1998