What has been going on in the hot parallel database market? This update puts the tools in their places.
As I write this article, 1995 is coming to an end. And as a new year approaches, it's often instructive to look back over the last 12 months and recall the major events that occurred during that time. This is certainly true when you consider the commerci al parallel processing market. By reviewing the major events and trends of the past year, and by looking at where the market is now, current and potential end users of parallel systems can better understand where the market will be going in the near futu re.
Unquestionably, 1995 was a busy year for the commercial parallel processing market. If I had to choose one phrase to summarize the year, it would be "increased mindshare." By this I mean that the term "parallel processing" has emerged from relative obscu rity only a few years ago to a commonplace phrase bandied about by IT managers. The term is associated not only with symmetric multiprocessing (SMP) machines, clustered machines, and massively parallel processor (MPP) machines, but it is also associated with the parallel databases that run on these platforms.
Parallel processing has finally reached the point at which few large-scale IT projects are undertaken without first evaluating whether parallel systems are really the right solution. However, it's important to realize that increased mindshare is only one step in the process of solidifying parallel systems' role in the corporate data center. For this increased mindshare to translate into increased marketshare, vendors and service providers will have to deliver on the promises and announcements they made in 1995.
The purchase of Pyramid Technology Corp. by Siemens Nixdorf Information Systems was another important event in the parallel systems market. Before Siemens, Pyramid had built a reputation for its Nile series platforms as robust SMP servers, and had instal led its RM1000 MPP platform into a handful of highly visible accounts, where the machine proved to be a powerful solution. However, now that it has Siemens behind it, Pyramid has the resources and expanded selling channels it needs to promote its paralle l solutions more aggressively to large data centers.
Compaq Computer Corp. became the newest entrant in the parallel systems market when it announced that it will deliver the capability to cluster its Pentium Pro PC systems for increased performance and improved fault tolerance. This announcement is import ant not only because it marks the entrance of one of the largest computer manufacturers into the parallel systems market, but also because the announced clustered system will have the potential to change the dynamics of the parallel system market radical ly. Previously, even the smallest parallel system would cost more than $100,000, and end users had to contact the vendor each time they wanted to scale up their system's processing capabilities. However, when Compaq delivers on its promise, end users wil l be able to go to their local computer store and purchase a small parallel system for an amount that will likely be within the credit limits of a corporate credit card. Plus, scaling up the system will be just as easy. Essentially, the Compaq solution h as the potential to be the first mass-marketed parallel system.
Clusters usually refer to hardware architectures in which a few SMP machines are linked together by an interconnect. This allows you to harness the collective processing power of multiple SMP machines (referred to as "nodes"), rather than being limited t o the processing power of a single SMP machine. However, in a cluster, each node has its own private memory, so communication between processors is more complex. A processor on one node cannot simply look at the results generated by the processors on ano ther node. Instead, nodes communicate by explicitly sending messages and data across the interconnect.
MPP machines are conceptually similar to clusters in that they support multiple nodes that each have their own private memory, and that communicate via passing messages over an interconnect. One difference between MPPs and clusters is that the MPPs usual ly have uniprocessor nodes rather than SMP nodes. However, the major difference between MPPs and clusters is that the interconnect is much more sophisticated -- the bandwidth of the interconnect is often designed to increase as more nodes are added and m ore advanced connection schemes are used. The benefit of the advanced interconnect is that MPP platforms can handle up to hundreds of nodes, while it can be a challenge to coordinate that many nodes using the message-passing communication mechanism.
With all these options to choose from, which platform is best: SMPs, clusters, or MPPs? In 1995, the majority of end users came to the conclusion that this is a religious war, because no single architecture is universally better than all of the others. I n general, SMP is easier to manage and MPP is more scalable. Which platform is optimal completely depends on the specific application itself.
Over the past year, more real-world feedback was gathered informally regarding which types of applications are best suited to which platforms. There are large gray areas, but, in general, SMP systems seem to be best suited for either mission-critical or OLTP applications, where the application's growth rate is slow and steady at less than 20 percent annually and the amount of raw data is in the range of 10 to 100GB.
MPP systems are best suited for either complex analytical or very large decision-support applications, where the growth rate is unpredictable or more than 50 percent annually and the amount of raw data exceeds 200GB. Currently, there is not as much data on clustered systems, because clusters are primarily used as a way to add a second SMP system to an existing SMP system (in order to add fault-tolerance in case the first SMP fails).
Sequent Computer Systems Inc. also disclosed a different type of hybrid architecture when it announced its new MPP system, code named "Sting." As mentioned previously, in an MPP architecture each processing node has its own memory that is physically sepa rate from the memory on every other node. This creates a more complicated programming environment, because, unlike an SMP machine where all information in memory is shared among all processors, programmers of MPP machines must explicitly send messages am ong processing nodes if they want to share information between these nodes. However, through a technique known as non-uniform memory access (NUMA) for its Sting computer, Sequent will create the illusion that each processing node's memory is actually jus t a piece of a larger, globally shared memory.
NUMA architectures are not new, but historically the overhead involved in creating the illusion was too high and the illusion fell apart. Sequent's challenge (which it claims to have solved) is to design a NUMA system in which the overhead is low enough that the illusion works. If it can do this, it will have the first system that has the tremendous scalability of MPP systems, yet is still able to maintain the simpler, shared-memory programming model of an SMP system.
The lines defining the different parallel architectures are continuing to blur. When these new hybrids reach the market, the notion that SMP, clustered, and MPP architectures are distinct may disappear entirely. Once these distinct classifications are no longer useful, we will be left referring to all of these machines by the umbrella term of "parallel platforms."
Informix Software Inc. wasn't idle either. In 1994, Informix released Informix-OnLine version 7.1, which used the core parallelism built into its Dynamic Scalable Architecture (DSA) to enable the database engine to take advantage of SMP platforms. In 199 5, Informix extended the DSA architecture to take advantage of MPP platforms, and incorporated this functionality into a beta release of Informix-OnLine Extended Parallel Server (XPS) version 8.0.
Finally, Sybase Inc. made two recent announcements about the parallel systems market. First, its Navigation Server, which was released into production on the AT&T GIS 3600 platform in late 1994, has been ported to additional platforms (it is beta on IBM' s SP2 NPP platform, Hewlett-Packard's K400 cluster, and Sun Microsystems' Energizer cluster) and is now called Sybase MPP. Second and more important, Sybase announced System 11, which is designed to address the scalability problems System 10 encountered on SMP platforms. Emphasizing this design focus, one company officer quipped during the System 11 product announcement: "We had to ensure that System 11 would be scalable. If we told people that our new System 11 cures cancer, they would say, 'Great, but does it scale?'"
Computer Associates has no news in this arena. As for Microsoft, SQL Server 6 doesn't have parallel query capabilities (any parallel abilities at all, really), but rumors about parallelism in 1996 are coming out of Microsoft.
Briefly, in a shared-disk database architecture, each processing node can access the entire database. Even though certain disks in the system may be directly connected to certain nodes, the DBMS has no concept of a node "owning" a certain set of disks. R ather, all disks (and therefore all data) are accessible by all nodes.
In contrast, in a shared-nothing database architecture, each processing node is granted exclusive access to a portion of the database (known as a partition). Each node has its own set of local disks on which the node's database partition is stored, and n o other node can directly access those disks (and therefore no other node can directly access another node's partition of the data). Oracle uses the shared-disk approach, and IBM, Informix, Sybase, Tandem, and Teradata (owned by AT&T GIS) use the shared- nothing approach.
Theoretically, the advantage of a shared-disk database architecture is increased flexibility, because any node can be selected to access any piece of data. However, this flexibility incurs additional overhead because the nodes must coordinate how they wi ll share the data. In comparison, the theoretical advantage of a shared-nothing database architecture is increased scalability, because nodes don't have to coordinate data sharing, and therefore overall overhead is reduced. The drawback, however, is redu ced flexibility, because the DBMS must access a piece of data by using the node that owns the data.
Which architecture is best? Similar to the notion that no single hardware architecture is universally the best, neither of the parallel database architectures is always superior. First, each has strengths and weaknesses that manifest themselves different ly, depending on the specific application. Second and more important, developers and DBAs have discovered over the past year that it's the quality of the database engine that really counts, not the software architecture. That is, did the database vendor do a good job of writing efficient code? Is the query optimizer intelligent and robust? Did the vendor use sophisticated caching algorithms? Can you tune all of the important parameters? How easy is it to determine which parameters need to be tuned? When evaluating a parallel database, these are the real issues to explore.
This is another trend that tends to blur distinctions and definitions. Don't be surprised if other database vendors create hybrid architectures over the next few years.
However, once people started experimenting with parallel systems, bold new ideas surfaced. Application designers began considering what strategic business benefits their companies could gain by performing complex processing on mountains of data. Within a period of less than two years, the market changed from a situation in which parallel systems' capabilities exceeded end user's needs to one in which end user's needs occasionally exceeded these systems' current capabilities, requiring certain compromise s. Given this situation, it is easy to see the motivation behind all the new, more-advanced parallel products that were delivered or announced in 1995.
To understand why, you must look at two prominent characteristics of data warehouses. First, they are very large systems, often containing hundreds of gigabytes of data. Second, data warehouses are never static. Additional data is added to these warehous es at a rapid rate, and as users begin to see strategic value of these systems, new uses are also devised, which continually put additional demands on the processing and I/O capabilities of a data warehouse system. Therefore, these warehouses require und erlying hardware and software platforms that can handle this large size and rapid growth.
Now let's look at two prominent characteristics of parallel systems. First, parallel systems are designed to handle very large IT problems by giving users access to enormous processing power, storage capacity, and I/O bandwidth. Second, they're designed to scale up easily in order to handle rapid growth. Users can buy a modest-size machine to begin with, and, as their usage requirements grow, the system's capabilities can be expanded to be 10 times or even 100 times more powerful. The fact that these ca pabilities closely match the requirements of a data warehouse is the driving force behind the rapidly increasing usage of parallel hardware and software as the platform for building data warehouses.
Last year, corporations also turned toward parallel systems as the optimal solution for data mining. The reasons are the same as those for data warehouses, but they are even more prominent because not only are data mining applications large, and not only do they grow rapidly, they also have very high computational requirements. Data mining applications sift through vast amounts of data looking for hidden trends and correlations. The tremendous raw processing capabilities provided by parallel hardware sy stems can help perform these complex trend analyses or correlation calculations.
Various vendors will also need to deliver "parallel-aware" system administration utilities. With database sizes creeping into the multi-terabyte range, parallel load, backup, and recovery tools are a must. Without these, an otherwise feasible application can become unmanageable.
Finally, tracking all the administrative tasks associated with maintaining multi-terabyte databases is often beyond the capabilities of the human brain. Therefore, tools to automate aspects of parallel system administration will be a requirement, not a l uxury.