DBMS
 

 

March of the Data Marts

By Peter Brooks
DBMS, March 1997

If your enterprise dreads the sprawl of a data warehouse, consider the virtues of the data mart.


Data mart technology is changing -- growing beyond its roots of providing easy and fast access to one subject area of information for a small group of users. Organizations are realizing that enterprisewide data warehouses are complex to both build and use. To implement, data warehouse development requires significant staff, hardware, software, time, and funding. Conflicting organizational needs must be addressed. Data warehouses have been called data jailhouses, data basements, and data cemeteries because users find them difficult to understand and tough to navigate. Consequently, organizations are building data marts in place of or to complement enterprise data warehouses.

As the use of data marts has exploded, so too has the capability required of them. Rather than simply being small databases with easy end-user access, some data marts must now be scalable to consist of hundreds of gigabytes of data upon which sophisticated analyses such as data mining can be performed. Hundreds of users must be able to remotely access data marts -- a capability that Internet and Intranet technology can provide at low cost. Users are questioning why data mart installations require a large amount of custom setup. Finally, organizations need to be able to centrally administer and manage many data marts to prevent the proliferation of data marts that contain inconsistent and conflicting data -- just the problem that data warehousing was supposed to solve!

Data marts are growing to sizes that approach that of small-scale, enterprise-wide data warehouses. Traditionally, data marts were defined as small databases focusing on one subject or functional area while an enterprisewide data warehouse contained information from many different subject areas across the corporation. However, 100GB Unix and Windows NT decision-support databases are being built today. In some situations, this size of database includes all of a corporation's decision-support needs and is a data warehouse. In other situations, the same system would be a data mart extracted from a terabyte-sized data warehouse.

Although it is now difficult to differentiate data marts and data warehouses simply based on size, some distinctions are still important:

Vendors are developing the concept of a virtual data mart to satisfy the needs of users to access multiple data marts without requiring a lot of data replication between marts. Virtual data marts are views of several physical data marts or the enterprise data warehouse tailored to specific users or groups of users. A virtual sales data mart for a company's U.S. sale manager, for example, could consist of the U.S. sales data mart integrated with selected competitive information from the corporate data warehouse and selected summary sales information from the European and Asian territory sales data marts. To the U.S. sales manager, all data would appear to come from the U.S. sales data mart, even though data would actually reside in numerous data marts and data warehouses. (See Figure 1.) Security would allow access only to pertinent information.

New data mart technology is still being developed, although not as prolifically as several years ago when relational OLAP was new and numerous vendors came onto the data warehousing scene. Both Information Builders Inc. (IBI) and SAS Institute Inc. announced new databases in the last year: Focus Fusion and SAS MDDB, respectively.

The recent growth in data marts, however, has created several problems in providing users with access to an organization's information:

Data mart solutions require a two- or three-tier application database architecture -- the data warehouse being the optional first tier (if the data mart is an extract of a larger data warehouse), the data mart itself being the second tier, and the end-user workstation being the third tier. Data gets distributed among the three tiers; applications reside on the data mart and end-user workstation tiers. To support virtual data marts, Information Advantage Inc. supports multiple heterogeneous data mart servers and storing of metadata on a separate node independent of any particular data mart database.

The Importance of Performance

Data marts are growing in size -- even data mining technology is being used to process information in data marts. (See my article "Data Mining Today" in the February issue of DBMS, page 59.) Data mart performance involves the balancing of two critical components: end-user response time and data-load performance. A data mart designed for fast user response will have a large number of summary tables and aggregated values. Unfortunately, creating the summary tables and aggregates significantly increases the time of the data load procedures. Of course, hardware improvements and the use of parallel processing contribute to improved performance. However, this article focuses on performance improvements from perspectives more specific to a data mart database.

Red Brick Warehouse 5.0 from Red Brick Systems Inc. provides an example of significant performance improvements being made to an underlying data mart database engine. Features include Red Brick's Continually Adaptive Indexing TARGETindex capability, which provides indexes that automatically and continually adapt to the data being processed. A new hybrid hash join more efficiently handles situations such as subjoin recursion in very large joins, the joining of tables of significantly different sizes, and improved performance optimization of hash joins. To avoid having to run several SQL statements in a row to retrieve results for a complex query, SQL queries can be embedded in the FROM clause of another query. Like some other tools, initial rows of a query response can now be retrieved for end-user analysis while other result rows are being created.

Multidimensional databases (MDDBs) such as Arbor Software Corp.'s Essbase support incremental database updating so that the entire MDDB structure doesn't need to be changed for each update -- only data cells that are affected will get updated. This is an advantage because, unlike a relational table consisting of rows that can be individually updated, MDDB cubes traditionally have required the entire cube to be updated if any data in the cube changed -- a time-consuming process.

One promising yet underused area in data mart performance improvements involves decreasing the size of the data mart itself. Several vendors have created innovations in this area. Pilot Software Inc.'s Pilot Decision Support Suite now supports dynamic dimensions and hierarchies to significantly reduce server size and consolidation time. Dynamic dimensions allow aggregations to be calculated on demand rather than preaggregated and stored in the MDDB cube. This can significantly reduce the size of an MDDB cube as well as the consolidation time required to calculate MDDB aggregations. One MDDB reportedly decreased in size from 4GB to 200M using this technique.

CrossZ Software's data mart solution, QueryObject, can be viewed as either an MDDB or relational database using ODBC-compliant query tools. The company claims that fractal algorithms compress source data up to a 10,000-to-one ratio with 100-percent accuracy. End-user response time is still a primary consideration in all data mart designs, and no method guarantees satisfactory data mart performance. Although the approaches tried by various vendors improve data mart performance, there will always be tradeoffs in balancing end-user performance, precalculated vs. on-the-fly summarization, data load performance, and the size of the data mart.

Data Mart Administration

As the number of data marts in companies has grown, so has the need to centrally manage and coordinate data mart activities such as data mart versions, enterprisewide security authorization, and performance tuning. Without central management, data becomes inconsistent between departments, users can't access information in several data marts, and the data marts eventually become too disparate to be integrated into an enterprise data warehouse. Coordination and management of an entire enterprise's collection of data marts should have a centralized focus rather than requiring the owners of each data mart to learn and perform administration activities.

MicroStrategy Inc.'s DSS Administrator is designed to manage multiple decision-support system projects, user groups, and reports. (See Figure 2.) One powerful capability is the management of virtual data marts that enable users to access appropriate information throughout an enterprise from many different physical data marts. Users can be placed in groups for easy security administration. Application development can be managed by moving applications from a prototype to development to deployment data mart. The administrator can perform an overall system analysis as well as examine report generation response time and resource usage by time of day. System resource, report, and data usage is available by user. Bills and chargebacks can be generated for users of data marts based on usage, activity, and time of day. Because system metrics are stored in a database catalog, custom reports can also be generated.

One aspect of data mart administration is the ability to optimally tune the data mart. IBI's Site Analyzer enables an administrator to analyze user and resource statistics such as columns and tables accessed, performance, and resources consumed about each query. An administrator can use this information to set query thresholds. The IBI pre-emptive governor uses rules in a knowledgebase built from the system statistics and thresholds to estimate before a query is run whether it will exceed resource limits; then the governor accepts or rejects the request based on its estimate.

I'm relieved to see products that address the need for enterprise data mart administration. This is an area with growing requirements, such as the coordination of data mart and enterprisewide data extraction, load, and replication procedures, backup and recovery procedures, metadata management, security, and performance tuning.

Data Mart In a Box

Many vendors have recognized the need to make data marts easier to install and implement than an enterprise data warehouse. Although a data mart database is relatively easy to install, other issues arise, such as data sourcing, data loading, performance, and vendor software integration. The development of a "data mart in a box" is designed to minimize these problems by giving organizations one-stop shopping to build a data mart, typically within 90 days. Hardware, software, and/or professional services are included, although, interestingly, few offerings include all three components.

IBI's SmartMart employs its EDA middleware through source data acquisition tools to extract and process data from over 60 relational database or file sources into a data mart based on IBI's Fusion MDDB or major relational databases. IBI's Focus Six for Windows is the data access tool. In addition, WebFocus is the Internet end-user data access product. Warehouse administration tools and a metadata information catalog are included.

IBM's Visual Warehouse runs on an OS/2 or Windows NT server. The Windows NT version includes the Visual Warehouse server, ODBC drivers, a DDCS connectivity product, DataGuide metadata repository, and Lotus Approach for end-user analysis. All versions of DB2 and popular relational and nonrelational databases can be accessed.

Informatica Corp.'s PowerMart suite 3.5 contains Informatica PowerMart Designer, Repository, Server Manager, PowerMart Server, and Change/Capture components. Popular relational databases are supported. The Star Schema Design Wizard uses a visual interface to step a user through the database design. Once completed, the SQL data definition language is generated for the target database. The Change/Capture function accesses operational system log records to move incremental change records into the data mart. Changes can be captured as they occur or on a periodic basis.

NCR Corp.'s RightStart program is designed to deliver within 90 days a departmental data mart that can then grow into an enterprise data warehouse. The data mart includes a WorldMark 5100S server, the NCR Unix MP-RAS operating system, a database (Teradata, Oracle, or Informix), data access and transformation tools, and professional services. Separately, NCR has announced that it will port its Teradata database to Windows NT, provide high-speed asynchronous and synchronous replication, and improve interoperability between Teradata and Microsoft SQL Server databases.

Data marts in a box can provide a convenient and relatively low-cost source of data mart tools. Although these products can kick-start a data mart development project, issues such as operational data extraction logic, consistent data definitions, and designing the data mart for optimal performance must still be considered.

Internet/Intranet Promises

The promise of Internet/Intranet technology is to give users inexpensive access to the data warehouse and data marts to access corporate data marts with common (read: low-cost) Web browsers. Users can be employees or, more interestingly, customers, suppliers, or other corporate partners. MicroStrategy and Information Advantage provide ROLAP capability on the Web; Arbor Software and Pilot Software, among other MDDB vendors, each offer a Web product. (See "A New Face for OLAP" by Rich Carickhoff in the January issue of Internet Systems, page 24.)

Data mart Internet/Intranet access products generally consist of code that sits between a Web server and the vendor's OLAP product. Netscape Navigator and Microsoft Explorer browsers are supported. Leading data mart companies are beginning to include Java and/or ActiveX capabilities in their offerings, as compared to initial products that were constrained by HTML functionality.

The Windows NT-based Essbase Web Gateway provides OLAP functionality to users of Essbase. In addition to slicing and dicing, multiple users can write back to the database.

Pilot Software's Pilot Internet Publisher lets Pilot Desktop users access Pilot Decision Support Suite functionality via standard Web browsers. Pilot Software and Blue Isle Software Inc. have agreed to a partnership that will use Blue Isle's InTouch/2000 software to distribute Pilot personal cubes and reports to users via the Internet or major email systems. Data is compressed to five to 25 percent of its original size.

MicroStrategy's Windows NT-based DSS Web 4.1 features include a Java-based AutoPrompt facility to aid in building queries, international language support at the desktop level, and DBA-level diagnostics. The program supports Windows 3.1, Windows 95, Windows NT, OS/2, Macintosh, and Unix client operating systems.

Over time, data mart vendors should expand their Internet server support beyond Windows NT to Unix. Internet browser functionality must catch up to the functionality found in traditional client/server OLAP applications, or else there will be two classes of data mart access: basic slice-and-dice capability for Internet-based users, and more sophisticated capabilities for those with client/server access. Vendors may use Java and ActiveX to provide sophisticated functionality to Internet clients, although this could require getting away from the thin-client paradigm.

Although Internet browsers don't contain all of the access capabilities available in client/server applications, Internet access is a good solution for providing data mart access to users who are large in number or geographically dispersed -- especially users who don't need sophisticated interactive capabilities. These differences will disappear as Java and ActiveX technologies are exploited and users migrate to Windows 95.

What Should You Do?

Several fundamental approaches can give users access to decision-support data. One approach is to build an enterprise data warehouse that can be used directly by users and can also feed data marts. Other companies are building data marts planned for eventual integration into a data warehouse. Another technique becoming more popular is to build the infrastructure for an enterprise data warehouse while at the same time building one or more data marts to satisfy immediate business needs.

If you are investigating or building a data warehouse or data marts, then the benefits and costs of taking the time and resources to build an enterprisewide data warehouse must be balanced against the benefits and costs of getting a data mart up and running quickly to address specific business opportunities.

An enterprise data warehouse approach has several advantages: The requirements of all business functions can be included; data definitions and business rules are consistent; and data redundancy is minimized. Advantages to a data mart strategy include lower initial implementation costs and effort, as well as a shorter development timeframe; end users experience improved performance; and business functions can control their own data marts.

Although data marts can provide success in solving business problems, the proliferation of unplanned data marts throughout an enterprise can lead to data inconsistency, data duplication, and an inability of users to access all of the data they need. Vendors are addressing some of these ongoing challenges:






Figure 1.


--A virtual data mart lets a sales manager view information from several data marts and a data warehouse as if the data were all in one data mart.

Figure 2.


--MicroStrategy's DSS Administrator lets administrators create "virtual data marts" -- multiple, independent applications that access a segment of an enterprise data warehouse. (Courtesy of MicroStrategy Inc.)


Peter L. Brooks is a management consultant with the Advanced Technology Group of Coopers & Lybrand Consulting, where he specializes in leading organizations to achieve business value by applying strategic and innovative technologies such as data warehousing. Peter is based in Boston and can be reached at plbrooks@compuserve.com.


* Arbor Software Corp., Sunnyvale, CA; 408-744-9500 or fax 408-744-0400;
www.arborsoft.com.
* Blue Isle Software Inc., Lowell, MA; 888-258-4753 or 508-458-0017; www.blueisle.com.
* CrossZ Software, Mitchell Field, NY; 516-228-8500 or fax 516-228-8584; www.crossz.com.
* IBM, White Plains, NY; 1-800-426-3333; www.ibm.com.
* Informatica Corp., Menlo Park, CA; 415-462-8900 or fax 415-462-8910; www.informatica.com.
* Information Advantage Inc., Minneapolis, MN; 612-820-0702 or fax 612-820-0712; www.infoadvan.com.
* Information Builders Inc., New York, NY; 800-969-4636 or fax 212-967-6406; www.ibi.com.
* MicroStrategy Inc., Vienna, VA; 703-848-8600 or fax 703-848-8610; www.strategy.com.
* NCR Corp., Dayton, OH; 513-445-5000; www.ncr.com.
* Pilot Software Inc., Cambridge, MA; 617-374-9400 or fax 617-374-1110; www.pilotsw.com.
* Red Brick Systems Inc., Atlanta, GA; 800-777-2585, 408-399-3200, or fax 408-399-3277; www.redbrick.com.
* Sagent Technology Inc., Menlo Park, CA; 415-833-6800 or fax 415-833-6820; www.sagenttech.com.
* SAS Institute Inc., Cary, NC; 919-677-8000 or fax 919-677-8123; www.sas.com.




Subscribe to DBMS and Internet Systems -- It's free for qualified readers in the United States
March 1997 Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS and Internet Systems (http://www.dbmsmag.com)
Copyright © 1997 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated Tuesday, February 11, 1997.