A Snapshot View of Where the Data Warehouse Market is Now and Where it's Going.
Despite the immense popularity of data warehouses, the truth is that in a perfect world they wouldn't exist. We wouldn't need them. In such a world, all of our operational systems would have consistent data. We wouldn't have a point-of-sale system storing data as "Male" and "Female" while the inventory system stores that information as "M" and "F." Nor would we have our general ledger application storing "26-Nov-1995" while our American inventory system stores "11/26/95" and its European counterpart stores "26/11/95." And we wouldn't be bothered with such annoyances as having "delivery date" mean the promised delivery date in our order-entry system while it means actual delivery date in our package tracking application.
Additionally, our operational systems would all have enough performance capacity to let users perform large, ad hoc queries directly against the data without destroying the performance for the simultaneously occurring OLTP transactions. In a perfect world, this would be possible because all systems would be infinitely fast. And cheap, too.
Those of you waiting for such a world will be waiting a long, long time. Data warehouses exist because none of the above is true. In the real world, data warehouses solve a very real problem. Because they solve a very real problem, they are becoming extremely popular as the basis for sophisticated decision-support and strategic analysis applications in all types of companies, across all industries. And as their popularity grows exponentially, the market's characteristics change rapidly.
Countless words have been written describing the specific data warehouse products available today, comparing the features of each. My goal for this article is different. Rather than focus on individual products, I take a snapshot of the data warehouse market to discuss where it is now, highlight important issues, and identify trends to define where the market is likely to go.
The first step is to understand where the market is today. Probably the best way to determine the current status of a market is to go talk to that market. For the data warehousing market, survey after survey of end users all tell the same tale. Data warehousing is big. And growing. A February, 1996 survey by the Meta Group (Stamford, Conn.) found that 95 percent of companies surveyed intend to build a data warehouse. Granted, many of these companies have unique ideas about what constitutes a data warehouse, but what's most interesting is that in 1994, that number was only 15 percent.
Various estimates (including those of the Meta Group) put the size of the data warehouse market (which includes all hardware, database software, and access tools) at $2 billion. This number will grow to $8 billion by 1998, which translates into a 60 percent average annual growth rate. Additionally, this $8 billion will be complemented by another $5 billion in integration services.
Not only is the market growing, but individual data warehouse applications are also growing. Again drawing on the recent Meta Group survey, 60 percent of those surveyed expected their data warehouses to grow beyond 50GB by mid-1996, and at the same time they will need to support more than 50 users.
Still, those pondering building a data warehouse should exercise a healthy dose of caution - less than 15 percent are currently in full production. There are many reasons for this. One reason is simply that the exponential growth rate of new data warehouse projects implies that the majority of these projects have only recently started. However, another set of reasons is a little more troubling. Corporations are learning that large data warehouse projects cost more dollars and take more time than originally thought. On average, when implementing a data warehouse project corporations spend $2 million on hardware and $1 million on software and services, and the project takes 12 to 18 months.
Although data warehousing is a highly horizontal application relevant to all industries, certain industries are farther along in their deployment of these applications than are others. For example, the retail industry is completely overrun with data warehouse applications. Because it's such a nice fit for that industry, multidimensional modeling and the star schema have become the design of choice for these retail systems. This design has one large main "fact" table that holds information on such things as product shipments and product revenue, and includes foreign keys in smaller "dimension" tables that hold information about the various products, sales regions, sales people, and so on.
Leveraging the successes of the retail industry, other "data rich" industries such as telecommunications, transportation, health care, and finance dove head first into data warehousing. These industries have relied heavily on the design techniques that became prominent in retail, most notably the use of the star schema. In fact, many organizations across all industries are under the impression that data warehousing means that a star schema must be involved. But this is a mistaken impression; it confuses the conceptual notion of a data warehouse with a particular physical implementation. Some organizations are discovering that dimensional modeling and the star schema don't suit their particular needs very well. These organizations are now experimenting with more traditional relational design techniques such as logical data modeling.
This shouldn't come as a surprise to anyone. The pundits who wave the star schema banner certainly have provided the world with a valuable approach, but it is not nirvana for everyone. Different organizations have different structures. It would be wonderful if a single schema architecture were optimal for everyone, but it's not possible. Don't be afraid to explore alternatives to the star schema if you find your organization working very hard to force-fit your business into a dimensional model.
Over the past few years, vendors certainly can't be criticized as having ignored the data warehousing market. The prodigious efforts of their product developers have been outshined only by the efforts of their marketing departments. There has been, and will continue to be, a frenzy of new product releases and acquisitions as each vendor tries to provide the best solution for data warehousing.
When the concept of data warehousing first became popular, the early adopters were those companies with large amounts of resources to expend on having the latest technological solutions. However, we're well past this early adopter phase, and the majority of companies don't have the previously stated $3 million to spend, nor can they afford to wait 12 to 18 months.
To address the needs of their new user base, over the past year vendors expanded the focus of their data warehouse offerings. Rather than simply flooding the already confused market with more products, vendors took a step back and looked at what has become possibly the most critical component to selling their solutions: services. Now they all provide services that can help users build small systems on a small budget but which still provide real value in a short time frame, thereby allowing users to ask management for further resources to expand the project.
These "Quick Start" (or Fast Start, or Smart Start, or...) programs usually bundle hardware, software, and services. They are typically low-risk, fixed-price, fixed-time projects that last anywhere from 15 to 90 days, and anyone contemplating building a data warehouse should take advantage of these programs.
But what good is it to build a small proof-of-concept system if all you do is throw it away? Sure, you have at least demonstrated the potential value, but still no one wants to try to convince management that they'll now have to wait many months while you throw away the prototype and start to work on the real data warehouse. Wouldn't it be better to take what you've built and scale it up as your needs grow and as resources (slowly) become available?
Even if you decide to start with a relatively large data warehouse (in the 100GB range), you are by no means immune to the need to scale up your warehouse. Any self-respecting data warehouse grows in multiple directions simultaneously. When it comes to data warehouses, data begets data, and usage begets usage. As end users begin to see the value of the system, they realize how adding incremental amounts of data can increase the usefulness of the system. Similarly, word of mouth spreads about the warehouse, and the number of users who want access to the system can soar. Because of this, a conservative rule of thumb is that the amount of data in your warehouse will double in a year. Similarly, count on the number of users growing by an order of magnitude within two years, during which time the overall CPU cycles required to process all of the queries will grow at least 100 times. A system that can handle this growth will be viable for a longer period of time, thereby increasing your critically important return on investment (ROI).
To build a scalable data warehouse, corporations are rapidly turning to scalable/parallel hardware and database platforms as the new foundation of choice. Conventional systems, though they can sometimes handle large amounts of data, usually cannot meet these scalability requirements because their processing power is limited to a single CPU and their I/O capabilities have fixed limits. However, scalable/parallel systems, which include both hardware and database software components, address these issues directly by letting users add multiple processors, additional disks, and additional I/O bandwidth to the system.
Scalable hardware platforms fall into three categories: symmetrical multiprocessors (SMPs), clusters, and massively parallel processors (MPPs). SMPs have multiple processors (usually up to a few dozen) that all share the same physical memory and the same memory bus. However, after a few dozen processors, this single memory bus becomes a bottleneck, so adding additional processors will not increase performance. Clusters were designed to address this problem. Once an SMP has reached its scalability limit, the concept of clustering lets users connect multiple SMP machines (referred to as "nodes") together with a very fast interconnect that allows the machines to work together as if they were a single larger machine. However, after clustering a few machines together, this single interconnect also eventually becomes a bottleneck. MPPs solve this problem by introducing the concept of a scalable interconnect, meaning that the bandwidth of this interconnect grows each time an additional node is added to the system.
Scalable database software is the second critical component of a scalable data warehouse. The database must be able to take advantage of the multiple processors and multiple nodes within a scalable system. Fortunately, there's no lack of parallel databases. All of the major vendors (including Oracle, Informix, IBM, Sybase, Teradata, and Tandem) have been developing and selling versions of their databases that leverage the underlying scalable hardware.
Once a corporation decides to use scalable technologies for its warehouse, a question arises: Which hardware and software platforms are right for its needs? Regarding hardware, it is usually desirable to use the most mature technology to solve a problem, because it will be the most stable. However, the more mature hardware technologies are the least scalable. Therefore, the general guideline is to find the most mature technology that will still meet your scalability needs. For example, if you need very little scalability, then maybe you don't need scalable hardware or software at all, and you can build your solution using conventional, very mature mainframe technology. However, few data warehouses are static, so if you will need vast amounts of processing power and scalability, then first look at SMPs. If they meet your needs, then use that platform. If not, look at clusters. If even clusters can't meet your needs, then look at MPPs, which give you maximum performance and scalability, but which are the least mature of the three technologies.
Choosing a scalable/parallel database requires a different approach. Don't spend all of your energy analyzing the architectures of the various vendor databases (for example, "shared nothing" vs. "shared disk" database architectures). Currently, it is not the architecture that is the most important thing to consider. Rather, it is the actual implementation that's more important. That is, did the vendor's developers do a good job of building the RDBMS features that your application requires? The only way to discover this is to test your own application. It is highly advisable to build prototypes and create and run benchmarks as a proof-of-concept for various facets of your data warehouse application. (Often, the vendor's "Quick Start" programs will help you do exactly that.) This process takes some time and money, but will ultimately save you from making painful mistakes.
Once your data warehouse is built, all you have is an environment that is potentially very valuable. Until you actually get your data back out of your system in some meaningful way, that potential remains untapped. Three classes of end-user access tools have emerged that can help you unlock the potential value of your warehouse: query and reporting tools, OLAP tools, and automatic data mining tools.
Query and Reporting Tools. Often, query and reporting tools are not particularly targeted at the data warehouse environment, but are instead generic tools that let you ask fairly structured questions. These tools are usually uncomplicated and are optimal for asking questions such as how many items you currently have in inventory, or how many customers have taken advantage of a sales promotion coupon. The output of these tools is usually in the form of a report. There are dozens of products in this category.
OLAP Tools. OLAP tools can take you a step beyond the power of query and reporting tools. Admittedly, the lines here are fairly fuzzy, but in general the distinguishing factor is that with an OLAP tool the data is represented using a multidimensional model rather than the more traditional tabular data model. The traditional model defines a database schema that focuses on modeling a process or a function, and the information is viewed as a set of transactions, each of which occurred at some single point of time in the past. In comparison, the multidimensional model usually defines a star schema (described earlier) and views data not as a single event, but rather as the cumulative effect of these events over some period of time, such as weeks, months, and/or years.
Also, with OLAP tools the user generally doesn't view the data in standard report formats, but rather in grids or crosstabs that can be pivoted to quickly look at the data from different perspectives. Additionally, OLAP involves interactive querying of the data. A user is able to follow a train of thought by being able to look at information at one aggregation level (such as a sales region), and then drill down into successively more detailed information, such as state, city, and store.
Be careful not to get confused here. OLAP tools simply give you an interesting way to look at your data. They do not imply how the data is actually stored. Given that, it's not surprising that there are multiple ways to store the data. One popular choice involves storing the data in a dedicated multidimensional database (MDD). Examples include Arbor Software's Essbase, Oracle Express Server, and Planning Science's Gentia. The other popular choice involves storing the data in relational databases and having an OLAP tool work directly against the data. This is referred to as relational OLAP, or ROLAP. Examples include MicroStrategy's DSS Server and related products, Informix's Informix-MetaCube, Information Advantage's AXSYS, and Platinum Technologies' Platinum InfoBeacon. (Some also include Red Brick's Warehouse in this category, but it isn't really an OLAP tool; it is a relational database optimized for performing the types of operations that ROLAP tools need.)
Both of these storage methods have numerous strengths and weaknesses. The weaknesses are being rapidly addressed by the respective vendors. However, there is one fact that must be taken into consideration. Currently, data warehouses are predominantly built using relational databases. In fact, the association is so strong that for many people, a data warehouse implies a relational database. If you have a warehouse built on a relational database and you want to perform OLAP analysis against it, then ROLAP is a natural fit. Yes, you'll forego many of the useful features that currently exist only in MDDs and you'll miss out on some of the business modeling capabilities, but you'll be able to work directly with your relational warehouse.
This isn't to say that MDDs can't be a part of your data warehouse solution. It's just that MDDs aren't currently well-suited for large volumes of data (10GB is fine, but 100GB is stretching their capabilities). If you really want the functionality benefits that come with MDD, I'd suggest subsetting the data into smaller MDD-based "data marts."
Automatic Data Mining Tools. As I mentioned earlier, the potential value of a data warehouse is realized only if you can extract information from it. Essentially, the promise of data warehousing is twofold: easier access to consistent data, and the ability to discover previously hidden information, patterns, and trends about your business. Query and reporting tools and OLAP tools address the first promise, but because of a subtle issue, do not do very well with the second promise. The subtle issue I'm referring to is the fact that when you use query and reporting or OLAP tools, the answers you get are only as good as the questions you ask. You will only find interesting patterns if you're looking for them.
Previously, the paradigm for getting information out of your data warehouse was that end users would start with hunches about something they wanted to find, and then they would use various tools to ask questions that would either verify or contradict their hunches. Data mining is fundamentally different - rather than requiring user intuition and user initiative, the initiative automatically comes from the data mining tool itself.
Automatic data mining is a sophisticated technology that combines multiple advanced data analysis techniques, including statistical analysis, decision trees, and neural nets. (See Bruce Moxon's article, "Defining Data Mining.") It looks through all of your data to find trends, patterns, and relationships that you may not have otherwise noticed. Data mining itself is not new. For many years data mining techniques have been used in the retail industry to help identify which items frequently sell together. This information helps retailers identify additional selling opportunities and can even help them determine where products should be placed on shelves. Credit card companies use data mining to detect purchasing patterns that indicate that a credit card has been stolen. The financial industry uses data mining to identify trends in stock prices or associations between the movements of the prices between two or more stocks. Banks are using data mining techniques to identify which factors are the most accurate indicators of whether a customer is a good candidate for a mortgage loan. Telecommunication companies use data mining to pinpoint customers that are most likely to be interested in new calling plans or new types of services. Insurance companies use data mining to detect fraudulent activities by flagging those activities that are outside the norm. The health care industry uses data mining to determine which combinations of treatments are the most successful for a particular health condition.
If data mining is not new, then why is interest in it surging? There are two reasons. First, as data warehouses get larger, users are being buried by the sheer volume of data. Trying to find all the buried gold by knowing which questions to ask becomes impossible. Automatically searching for the nuggets of knowledge lets companies increase the ROI of their data warehousing efforts. Second, the rapid acceptance of parallel and scalable machines is contributing to the data mining surge. Searching through and performing complex calculations on mountains of data requires large computational resources. Parallel platforms, with their ability to support large numbers of processors and vast amounts of I/O, provide the necessary resources.
Because this field is only now becoming popular, the market is still too small to determine the "dominant" tool suppliers. In addition, the technology is not yet mature. However, expect this market to heat up. The potential benefits from using an automatic data mining tool are just too compelling to ignore.
Data warehouses are also causing a surge in the popularity of data visualization techniques for looking at data. Data visualization is not actually a separate class of tools; rather, it is a method for displaying the data that is output by any of the previously mentioned tools. These visualization techniques are particularly valuable for viewing data that is queried from a data warehouse because the end users of a data warehouse are usually looking for trends or patterns, and the human mind is better at noticing visual trends and patterns than it is at noticing the same information in a standard report or crosstab.
At the most simple level, you could claim that the various line, bar, and pie charts with which we are already familiar constitute data visualization. In reality, however, data visualization goes far beyond simple charts. It is a complex technique that is currently an area of increased research as developers try to determine how to best display complex relationships on a two-dimensional computer monitor. For example, multidimensional analysis may look at sales of products by category by region by month. How can you best display four or five dimensions graphically? Current techniques are experimenting with using different colors and different shapes to signify different dimensional values. For example, a scatter plot may be enhanced to show additional dimensions not just by plotting simple dots on X and Y (and possibly Z) axes, but by changing the dots into objects of different shapes and colors. In addition, animation can be used to show how values change over the time dimension. As you might imagine, data visualization graphics have the potential to be incomprehensible (and therefore useless) if they are not executed well. However, when done properly, an enormous amount of interrelated data can be consolidated into an intuitive graphical display. As an aid to comprehension, the results can be astonishing.
Commercial data visualization tools are still in their infancy, but this category will probably grow rapidly over the next few years. One example of data visualization software is Silicon Graphics Inc.'s MineSet, which includes several visualization tools that display the results of data mining operations performed by other components of MineSet.
A section on accessing warehouse data would not be complete without a mention of how the World Wide Web is changing the landscape of data warehousing as much as it is changing the landscape of just about everything these days. At a high level, the goal of data warehousing and the goal of the Web are the same: easy access to data. The value of a data warehouse is maximized when the right data is in the hands of those who need it, regardless of when and where they need it. However, corporations have typically had to struggle with complex client/server architectures to give end users the access they need. Mix in the additional issues that arise when you try to give remote users access to the warehouse, and the result is often a truly complex environment. In addition, all users need training on how to use the client applications.
The Web removes these issues. While it doesn't necessarily give you more ways to look at your data, it does give you more access to your data. Last year, more Web servers were sold for Intranets than were sold for external Internet use. Corporations discovered that the Web provides an efficient mechanism to realize the benefits of client/server architectures. Many of the same applications will work over the Internet, removing the complexity previously associated with supporting remote access. Finally, the client application is the same Web browser that is used for every other Web application, meaning that millions of people already know how to use it. Vendors are quickly providing new Web development tools that allow Web browsers to access data warehouses, thereby addressing the previously discussed needs of the end user: access to the right data, regardless of when and where they need it. Arbor Software's Essbase Web Gateway, MicroStrategy's DSSWeb, and Information Advantage's WebOLAP are some examples of OLAP vendors that now provide access to OLAP servers from Web browsers.
When something as popular as data warehousing arrives on the scene, it's human nature to tweak the concept to see how it can be made even more useful. Though not everyone will agree, most data warehouses were initially conceived as a single consolidated database and as read-only. Two movements are challenging those aspects: data marts and operational data warehouses. What is most interesting is that these trends are essentially at odds with each other.
Data marts have many definitions, but the type of data mart I'm referring to is created as a subset of a larger data warehouse. There are many reasons for creating a data mart. One reason is performance. A marketing user might not want to have to use a machine that is always brought to its knees because the people in finance run complex queries against the warehouse. Another reason is simplicity. You limit the scope of what you're trying to achieve with a data mart, thereby making the solution simpler to understand and maintain. Finally, there is the autonomy issue. Data marts break up the potentially monolithic data warehouse, allowing each group of users to autonomously manipulate just the data they are most interested in, thereby decentralizing the data access.
Operational data warehouses move in the opposite direction. Rather than portioning off smaller pieces of the data warehouse, they combine the data warehouse with the traditional OLTP operational systems. These systems allow users to analyze the data and then take actions based on the analysis in realtime, all within the same system. For example, you can look at the purchasing patterns of recent sales and, based on that analysis, enter transactions to ship various products to your various stores. Or, you might analyze stock price trends and then issue buy and sell orders all within the same system. In addition, the resources expended on extracting data from the operational system can be reduced because the warehouse and the operational system are no longer two separate systems.
Essentially, data marts further replicate data and decentralize access, and operational data warehouses further consolidate data and centralize access. In the near future, data marts will flourish because people like to have decentralized control over their own data. (For example, people have always found ways to copy data out of a database and place it in a local spreadsheet so they can manipulate the data locally.) Automatic tools are being developed that will help keep the information in the data marts current and consistent with the data in the enterprise-level data warehouse. On the other hand, operational data warehouses will only play a niche role in the near future. The operational users are different from the analysis users, and it will probably stay that way for some time. In addition, even though this might sound like the "perfect world" solution, the challenges of resolving the performance implications of having critical OLTP applications share the same platform as resource-hungry DSS applications are daunting.
Data warehouses are clearly a viable solution to a real problem. With database software vendors scrambling to incorporate specific data warehouse features into their products, hardware vendors incorporating scalable technologies into their products, access tools developers adding a plethora of power features into their products, and all of them introducing quick-start programs, the data warehouse market will continue to grow rapidly. It may not be the vision of utopia described at the start of this article, but at least we can now get our jobs done a little more easily.