DBMS

Parallel Systems in 1995: The Year in Review

By Ken Rudin
DBMS, March 1996

What has been going on in the hot parallel database market? This update puts the tools in their places.


As I write this article, 1995 is coming to an end. And as a new year approaches, it's often instructive to look back over the last 12 months and recall the major events that occurred during that time. This is certainly true when you consider the commerci al parallel processing market. By reviewing the major events and trends of the past year, and by looking at where the market is now, current and potential end users of parallel systems can better understand where the market will be going in the near futu re.

Unquestionably, 1995 was a busy year for the commercial parallel processing market. If I had to choose one phrase to summarize the year, it would be "increased mindshare." By this I mean that the term "parallel processing" has emerged from relative obscu rity only a few years ago to a commonplace phrase bandied about by IT managers. The term is associated not only with symmetric multiprocessing (SMP) machines, clustered machines, and massively parallel processor (MPP) machines, but it is also associated with the parallel databases that run on these platforms.

Parallel processing has finally reached the point at which few large-scale IT projects are undertaken without first evaluating whether parallel systems are really the right solution. However, it's important to realize that increased mindshare is only one step in the process of solidifying parallel systems' role in the corporate data center. For this increased mindshare to translate into increased marketshare, vendors and service providers will have to deliver on the promises and announcements they made in 1995.

Parallel Hardware Vendors Were Busy in 1995

Much of this increased mindshare is due to a flurry of activity by the hardware vendors. For example, IBM Corp. helped validate the MPP market by aggressively promoting its SP2 Scalable Powerparallel System. In a single year, IBM installed hundreds of SP 2s into commercial data-processing sites. Roughly half of these machines are being used for true parallel processing applications (the rest are being used for LAN consolidation), earning IBM the title of having the most new parallel processing installati ons in 1995 of any hardware vendor.

The purchase of Pyramid Technology Corp. by Siemens Nixdorf Information Systems was another important event in the parallel systems market. Before Siemens, Pyramid had built a reputation for its Nile series platforms as robust SMP servers, and had instal led its RM1000 MPP platform into a handful of highly visible accounts, where the machine proved to be a powerful solution. However, now that it has Siemens behind it, Pyramid has the resources and expanded selling channels it needs to promote its paralle l solutions more aggressively to large data centers.

Compaq Computer Corp. became the newest entrant in the parallel systems market when it announced that it will deliver the capability to cluster its Pentium Pro PC systems for increased performance and improved fault tolerance. This announcement is import ant not only because it marks the entrance of one of the largest computer manufacturers into the parallel systems market, but also because the announced clustered system will have the potential to change the dynamics of the parallel system market radical ly. Previously, even the smallest parallel system would cost more than $100,000, and end users had to contact the vendor each time they wanted to scale up their system's processing capabilities. However, when Compaq delivers on its promise, end users wil l be able to go to their local computer store and purchase a small parallel system for an amount that will likely be within the credit limits of a corporate credit card. Plus, scaling up the system will be just as easy. Essentially, the Compaq solution h as the potential to be the first mass-marketed parallel system.

Religious War #1: SMP vs. Cluster vs. MPP

The categorizations of SMP, cluster, and MPP are often a little blurry (and are getting blurrier), but they refer to architectural design choices made by the hardware vendor regarding how multiple processors will be incorporated into a single computer. I n a typical SMP architecture, the machine has up to a few dozen processors, and each processor shares all hardware resources, including memory, disks, and the system bus. Because each processor can see all of the available memory, communicating between p rocessors is straightforward -- one processor can easily view the results generated by another processor simply by looking at the appropriate section of memory.

Clusters usually refer to hardware architectures in which a few SMP machines are linked together by an interconnect. This allows you to harness the collective processing power of multiple SMP machines (referred to as "nodes"), rather than being limited t o the processing power of a single SMP machine. However, in a cluster, each node has its own private memory, so communication between processors is more complex. A processor on one node cannot simply look at the results generated by the processors on ano ther node. Instead, nodes communicate by explicitly sending messages and data across the interconnect.

MPP machines are conceptually similar to clusters in that they support multiple nodes that each have their own private memory, and that communicate via passing messages over an interconnect. One difference between MPPs and clusters is that the MPPs usual ly have uniprocessor nodes rather than SMP nodes. However, the major difference between MPPs and clusters is that the interconnect is much more sophisticated -- the bandwidth of the interconnect is often designed to increase as more nodes are added and m ore advanced connection schemes are used. The benefit of the advanced interconnect is that MPP platforms can handle up to hundreds of nodes, while it can be a challenge to coordinate that many nodes using the message-passing communication mechanism.

With all these options to choose from, which platform is best: SMPs, clusters, or MPPs? In 1995, the majority of end users came to the conclusion that this is a religious war, because no single architecture is universally better than all of the others. I n general, SMP is easier to manage and MPP is more scalable. Which platform is optimal completely depends on the specific application itself.

Over the past year, more real-world feedback was gathered informally regarding which types of applications are best suited to which platforms. There are large gray areas, but, in general, SMP systems seem to be best suited for either mission-critical or OLTP applications, where the application's growth rate is slow and steady at less than 20 percent annually and the amount of raw data is in the range of 10 to 100GB.

MPP systems are best suited for either complex analytical or very large decision-support applications, where the growth rate is unpredictable or more than 50 percent annually and the amount of raw data exceeds 200GB. Currently, there is not as much data on clustered systems, because clusters are primarily used as a way to add a second SMP system to an existing SMP system (in order to add fault-tolerance in case the first SMP fails).

Enter the Hybrids

Not only has it become clear that no single architecture is best for every application, but a trend emerged in 1995 that combines the best of all the architectures into a single machine. For example, there are strong rumors about IBM's forthcoming SP3 pr oduct, which will be a massively parallel machine that uses small SMP machines as its individual processing nodes. In addition, Pyramid's RM1000 MPP platform was designed to use SMPs as its processing nodes and to include the fault-tolerance advantages o f clustering.

Sequent Computer Systems Inc. also disclosed a different type of hybrid architecture when it announced its new MPP system, code named "Sting." As mentioned previously, in an MPP architecture each processing node has its own memory that is physically sepa rate from the memory on every other node. This creates a more complicated programming environment, because, unlike an SMP machine where all information in memory is shared among all processors, programmers of MPP machines must explicitly send messages am ong processing nodes if they want to share information between these nodes. However, through a technique known as non-uniform memory access (NUMA) for its Sting computer, Sequent will create the illusion that each processing node's memory is actually jus t a piece of a larger, globally shared memory.

NUMA architectures are not new, but historically the overhead involved in creating the illusion was too high and the illusion fell apart. Sequent's challenge (which it claims to have solved) is to design a NUMA system in which the overhead is low enough that the illusion works. If it can do this, it will have the first system that has the tremendous scalability of MPP systems, yet is still able to maintain the simpler, shared-memory programming model of an SMP system.

The lines defining the different parallel architectures are continuing to blur. When these new hybrids reach the market, the notion that SMP, clustered, and MPP architectures are distinct may disappear entirely. Once these distinct classifications are no longer useful, we will be left referring to all of these machines by the umbrella term of "parallel platforms."

Parallel Database Vendors Were Also Busy

Parallel database vendors also had a slew of new product offerings to help increase the market's awareness of parallel systems. In fact, each of the four largest RDBMS vendors released products that focused on scalability and parallelism. For example, fo r several years Oracle Corp. has had versions of its database that can run on all parallel hardware platforms. But, in 1995, Oracle announced the general availability of Oracle7.2, which focuses primarily on significantly improving the performance of par allel queries on SMP and MPP platforms. In turn, IBM announced the general availability of DB2/6000 Parallel Edition, the first version of DB2 to take advantage of the parallelism and scalability of MPP hardware platforms. (For a nuts-and-bolts discussio n of IBM's DB2 Parallel Edition database server, see Stewart Miller's column, "Parallel Processing with DB2 PE,".)

Informix Software Inc. wasn't idle either. In 1994, Informix released Informix-OnLine version 7.1, which used the core parallelism built into its Dynamic Scalable Architecture (DSA) to enable the database engine to take advantage of SMP platforms. In 199 5, Informix extended the DSA architecture to take advantage of MPP platforms, and incorporated this functionality into a beta release of Informix-OnLine Extended Parallel Server (XPS) version 8.0.

Finally, Sybase Inc. made two recent announcements about the parallel systems market. First, its Navigation Server, which was released into production on the AT&T GIS 3600 platform in late 1994, has been ported to additional platforms (it is beta on IBM' s SP2 NPP platform, Hewlett-Packard's K400 cluster, and Sun Microsystems' Energizer cluster) and is now called Sybase MPP. Second and more important, Sybase announced System 11, which is designed to address the scalability problems System 10 encountered on SMP platforms. Emphasizing this design focus, one company officer quipped during the System 11 product announcement: "We had to ensure that System 11 would be scalable. If we told people that our new System 11 cures cancer, they would say, 'Great, but does it scale?'"

Computer Associates has no news in this arena. As for Microsoft, SQL Server 6 doesn't have parallel query capabilities (any parallel abilities at all, really), but rumors about parallelism in 1996 are coming out of Microsoft.

Religious War #2: Parallel Database Architectures

As market awareness of the advantages of parallel systems increased over the past year, each database vendor tried to ensure that its particular parallel database architecture was perceived as optimal. Each wanted to ensure that it got the largest piece of the rapidly growing mindshare. This led to intensified battles among the vendors regarding the virtues and pitfalls of the two main parallel database architectures used for clustered and MPP hardware: shared-disk and shared-nothing.

Briefly, in a shared-disk database architecture, each processing node can access the entire database. Even though certain disks in the system may be directly connected to certain nodes, the DBMS has no concept of a node "owning" a certain set of disks. R ather, all disks (and therefore all data) are accessible by all nodes.

In contrast, in a shared-nothing database architecture, each processing node is granted exclusive access to a portion of the database (known as a partition). Each node has its own set of local disks on which the node's database partition is stored, and n o other node can directly access those disks (and therefore no other node can directly access another node's partition of the data). Oracle uses the shared-disk approach, and IBM, Informix, Sybase, Tandem, and Teradata (owned by AT&T GIS) use the shared- nothing approach.

Theoretically, the advantage of a shared-disk database architecture is increased flexibility, because any node can be selected to access any piece of data. However, this flexibility incurs additional overhead because the nodes must coordinate how they wi ll share the data. In comparison, the theoretical advantage of a shared-nothing database architecture is increased scalability, because nodes don't have to coordinate data sharing, and therefore overall overhead is reduced. The drawback, however, is redu ced flexibility, because the DBMS must access a piece of data by using the node that owns the data.

Which architecture is best? Similar to the notion that no single hardware architecture is universally the best, neither of the parallel database architectures is always superior. First, each has strengths and weaknesses that manifest themselves different ly, depending on the specific application. Second and more important, developers and DBAs have discovered over the past year that it's the quality of the database engine that really counts, not the software architecture. That is, did the database vendor do a good job of writing efficient code? Is the query optimizer intelligent and robust? Did the vendor use sophisticated caching algorithms? Can you tune all of the important parameters? How easy is it to determine which parameters need to be tuned? When evaluating a parallel database, these are the real issues to explore.

Enter the Hybrids, Again

The trend of combining architectures in the hardware world is also emerging in the parallel database software world, with Oracle being the first vendor to move toward combining shared-disk and shared-nothing capabilities. This past year, Oracle announced Oracle7.3, which begins to incorporate the notion of database partitions. When a shared-nothing approach is the optimal method for database requests, the partitions will be used in a shared-nothing fashion. However, when a shared-disk approach is optima l for database requests, the concept of partitions will be overridden and nodes will share data across partitions.

This is another trend that tends to blur distinctions and definitions. Don't be surprised if other database vendors create hybrid architectures over the next few years.

1995's Most Important Applications for Parallel Systems

Having looked at the events and trends regarding parallel systems vendors, it is important to define what these systems are currently being used for and where they are used effectively. Only about two years ago, the capabilities of parallel systems excee ded end user's needs. Parallel processing was not a common concept, so IT managers weren't exposed to the new types of applications that they could potentially build. They kept thinking of new applications using their old paradigms, so new applications w ere often fairly conservative. For example, a common application idea was to move an existing application to a parallel platform to gain improvements in performance. Generally, these applications were well within the physical capabilities of parallel pla tforms.

However, once people started experimenting with parallel systems, bold new ideas surfaced. Application designers began considering what strategic business benefits their companies could gain by performing complex processing on mountains of data. Within a period of less than two years, the market changed from a situation in which parallel systems' capabilities exceeded end user's needs to one in which end user's needs occasionally exceeded these systems' current capabilities, requiring certain compromise s. Given this situation, it is easy to see the motivation behind all the new, more-advanced parallel products that were delivered or announced in 1995.

Data Warehouses, Data Mining, and Parallel Systems

One of the primary new applications for parallel systems is the data warehouse, which holds the promise of allowing corporate data centers to consolidate the mountains of data they collect every year, and then use this data to understand their customers, their markets, and their internal processes better. In 1995, as data warehouses became more popular, corporations turned toward scalable parallel computing technologies as the new foundation on which to build their warehouses.

To understand why, you must look at two prominent characteristics of data warehouses. First, they are very large systems, often containing hundreds of gigabytes of data. Second, data warehouses are never static. Additional data is added to these warehous es at a rapid rate, and as users begin to see strategic value of these systems, new uses are also devised, which continually put additional demands on the processing and I/O capabilities of a data warehouse system. Therefore, these warehouses require und erlying hardware and software platforms that can handle this large size and rapid growth.

Now let's look at two prominent characteristics of parallel systems. First, parallel systems are designed to handle very large IT problems by giving users access to enormous processing power, storage capacity, and I/O bandwidth. Second, they're designed to scale up easily in order to handle rapid growth. Users can buy a modest-size machine to begin with, and, as their usage requirements grow, the system's capabilities can be expanded to be 10 times or even 100 times more powerful. The fact that these ca pabilities closely match the requirements of a data warehouse is the driving force behind the rapidly increasing usage of parallel hardware and software as the platform for building data warehouses.

Last year, corporations also turned toward parallel systems as the optimal solution for data mining. The reasons are the same as those for data warehouses, but they are even more prominent because not only are data mining applications large, and not only do they grow rapidly, they also have very high computational requirements. Data mining applications sift through vast amounts of data looking for hidden trends and correlations. The tremendous raw processing capabilities provided by parallel hardware sy stems can help perform these complex trend analyses or correlation calculations.

Applications on the Horizon for 1996

The huge surge in the popularity of the Internet and the World Wide Web in 1995 will have a big effect on the parallel systems market in the next few years. The promise of the Internet and the Web is easy access to data on a scale that was previously uni maginable. More and more data will be put online at a rapid rate, which will lead to increased demand for big back-end data servers that can handle both the large volumes of data and the rapid growth. In addition, the increasing use of voice, image, and video data types will add to this demand for large servers. Therefore, back-end Internet data servers will be the most prominent new applications for parallel hardware and software in the coming year.

What is Still Needed?

For the parallel systems market to continue its rapid growth, not only will the vendors need to deliver on their current promises, but they (or other third-party vendors) will also have to fulfill a growing market need for "parallel aware" tools that are built specifically for the parallel systems environment. For example, system configuration tools must improve. Currently, parallel systems users have very little guidance regarding how big their parallel systems need to be to address a particular applic ation. In addition, data layout tools must be improved. Parallel systems often have hundreds of disks, and trying to distribute the data across these disks manually is not only tedious, it is also error prone. Application development tools and techniques will also require changes to help developers design parallel applications that are bottleneck-free and scalable, so that the applications can take advantage of the full parallelism and scalability of the underlying hardware.

Various vendors will also need to deliver "parallel-aware" system administration utilities. With database sizes creeping into the multi-terabyte range, parallel load, backup, and recovery tools are a must. Without these, an otherwise feasible application can become unmanageable.

Finally, tracking all the administrative tasks associated with maintaining multi-terabyte databases is often beyond the capabilities of the human brain. Therefore, tools to automate aspects of parallel system administration will be a requirement, not a l uxury.

Converting Increased Mindshare into Increased Market Acceptance

In retrospect, the parallel systems marketplace was far more active in 1995 than in any previous year. Several more corporations explored parallel processing as a key component of their data-processing solution, major hardware and database software vendo rs made important announcements regarding their parallel solution strategies, and most major conferences included speakers discussing how they developed parallel solutions for their IT needs. Due to these factors, parallel processing is on everyone's min d. 1996 will be a critical year for determining whether the parallel systems vendors and service providers can deliver robust scalable solutions based on parallel technologies, thereby moving parallel systems from being on everyone's mind to being in eve ryone's data centers.


Ken Rudin is a managing director of Emergent Corp., a San Carlos, California-based consultancy that specializes in implementing IT solutions based on commercial parallel processing systems. You can reach Ken at 415-921-7000 or via the Internet at krudin@ emergent.com.


* Compaq Computer Corp., 20555 State Hwy. 249, Houston, TX 77070-2698; 713-514-0484 or fax 713-514-4583.
* IBM Corp., One Old Orchard Rd., Armonk, NY 10504; 914-765-1900 or fax 914-765-4190.
* Informix Software Inc., 4100 Bohannon Dr., Menlo Park, CA 94025; 415-926-6300 or fax 415-506-7200.
* Oracle Corp., 500 Oracle Pkwy., Redwood Shores, CA 94065; 415-506-7000 or fax 415-506-7200.
* Pyramid Technology Corp. (a subsidiary of Siemens Nixdorf Information Systems Inc.), 3860 N. First Ave., San Jose, CA 95134-1702; 408-428-9000 or fax 408-428-8050.
* Sequent Computer Systems Inc., 15450 S.W. Koll Pkwy., Beaverton, OR 97006-6063; 503-626-5700 or fax 503-578-9890.
* Sybase Inc., 6475 Christie Ave., Emeryville, CA 94608; 510-922-3500 or fax 510-922-9441.


Subscribe to DBMS and Internet Systems -- It's free for qualified readers in the United States
March 1996 Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS and Internet Systems (http://www.dbmsmag.com)
Copyright © 1996 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated Wednesday, November 6, 1996