DBMS, March 1996
Server Side By Stewart Miller

Parallel Processing with DB2 PE

Everything You Always Wanted To Know About DB2 Parallel Edition But Were Afraid To Ask.

DB2 Parallel Edition (PE) for AIX is IBM's software solution for faster processing of complex queries and millions of transactions. Remember those extremely data-intensive commercial and scientific problems that either took too long or were too expensive to run on other systems? IBM promises that DB2 PE's highly scalable architecture helps solve those critical problems.

Businesses increasingly need to analyze their growing volumes of transaction data for trends such as purchasing, inventory, and budgets. DB2 PE is built upon the DB2/6000 database server and now provides multinode parallelism, expanded performance, and g reater capacity. DB2 PE is able to provide the same response time with larger databases, simply by adding nodes. Assume that a decision-support application running on DB2/6000 with a 100MB database has a satisfactory response time. With DB2 PE, a two-nod e system would yield the same response time for a 200MB database, a four-node system would have the same response time for a 400MB database, and so on. With the scalability of DB2, you can keep a uniform response time as the size of the database increase s.

With DB2 PE, you can "mine" enormous data stores to discover useful information that was formerly lost in the proverbial shuffle. You can query entire databases, as opposed to scanning random samples. This unveils trends and can result in more accurate p redictions regarding shipments and sales volumes. Geographical patterns emerge, which enable IS executives to determine the prudence of moving products from one region to another for better sales. Plus, customer buying patterns may emerge that let retail ers examine transaction data on an individual customer basis.

Speed!

While benchmarks are usually considered biased, the information in this section is obtained from both IBM and customer results in three categories: standalone numbers (this basic metric includes capacity and load times), speedup (this metric measures que ry and utility performance as the number of nodes are increased in the system while maintaining the same database size), and scaleup (this metric measures the system performance as the database size, number of concurrent users, and number of nodes are sc aled proportionately).

Figure 1 shows the parallelism benefits of DB2 PE. In Table A, the number of rows is 1,000,000, the row size is 100, and the total size is 100MB. In Table B, the number of rows is 1,000,000, the row size is 1000, and the total size is 1GB. You can see from the tables that the scan enhanced performance by decreasing execution time.

When table scans are performed using more nodes, the execution times eventually flatten out. The reason is that the table partitions at each node become small enough so that the overhead of starting the scans at the different nodes offsets the performanc e gain from the parallel scan. In short, parallelism benefits for any operation are bounded by the sizes of the tables.

The performance benchmark on Figure 1's Table B is for a parallel scan operation that returns 10 percent (100,000 rows) of the data to the application, as shown in Figure 2. As the size of the system (the number of nodes) increases , the execution times improve, but the speedup is sublinear due to the serial bottleneck. In order to overcome this limitation, the application must be parallelized. Therefore, DB2 PE divides the application into multiple tasks, each running on a separat e coordinator. Divisions are based on either the range of data or where each task operates on a subset of the database nodes.

Figure 3 shows what happens when you perform an insert into: a temporary table of one row, one percent of Table A's total rows, and 10 percent of Table A's total rows. The linear execution time curves show near linear speedup gains as the system size increases. Linear speedup across different nodes occurs when you parallelize both the insert and subselect operations of this statement. This demonstrates that DB2 PE performs well for the parallel execution of insert, update, and del ete statements. In terms of index creation, performance benchmark curves show near linear performance improvement.

IBM performed concurrent execution scalability tests to measure the execution times of several queries submitted in a single stream by a single user. Next, IBM performed tests to measure queries that were concurrently submitted by several users where coo rdinator activity was distributed over all nodes in the systems. The results indicate that DB2 PE is able to scale superlinearly with respect to the concurrent users. DB2 PE is able to achieve this performance because it reuses the database buffers at ea ch node when it processes common concurrent requests. This will apply as long as the queries benefit from reusing pages that have been read from other queries. This concept is based on caching. If one query reads a table into memory and other queries sub sequently access that table, those queries will scale superlinearly. If queries have nothing in common, throughput will still increase until the system resources are exhausted. Short queries that do not gain significant performance improvements from para llelism include single-row updates, selects, and inserts.

While both speedup and scalability benchmarks are quite good for most queries and utilities, queries that require significant serial operations do not reflect linear speedup or scaleup. Examples include queries that return a significant number of rows to the application, or queries that involve coordinator sorts (which occur at the coordinator node before the data is returned to the application) or DISTINCT SQL operations. In addition, when the coordinator activity is high in proportion to the total act ivity, the system performance decreases. Parallelism only slightly improves performance for queries executed in extremely short times on a serial database, because serial database execution strategies are already efficient.

DB2 PE provides a database monitoring facility that lets DBAs collect data regarding resource consumption at the database manager, database, application, and individual processing levels. DBAs can determine where the bottlenecks are by analyzing the data collected at each node. Combining this information across all nodes provides a global picture regarding the true performance of the entire system.

Shared-Nothing Architecture

The strength behind DB2 PE is its scalable shared-nothing architecture. Each processor has its own independent operating system, memory, and disk. Significant performance improvements occur when the number of processors increases. In essence, doubling th e number of processors should result in a 50 percent decline in processing time.

DB2 PE stores the database across a network of processors. This approach prevents cache contention because it uses only one set of resources (which is shared by all processors). Each processor has its own log, so parallel recovery from failure in applica tions is not limited by the I/O bandwidth of a single log. No resources are shared among processors aside from the communication network, because each processor has its own memory, as well as local disks. Therefore, this architecture eliminates the memor y-access bottleneck that occurs in other, shared-something architectures.

Supporting the shared-nothing approach requires SQL requests to be broken into multiple subrequests, which are input to different nodes in the system. The output is then merged from these multiple nodes. Also, this environment requires distributed deadlo ck detection and a multiphase commit protocol.

Because resources are not shared in this environment, DB2 PE uses "function shipping," in which database operations are performed where the data resides, thereby minimizing network traffic. When dealing with task structure for simple queries, the query c ompiler determines the function to be performed by each task at runtime. The coordinator task is then instantiated on the node to which the application connects, and each slave task is instantiated on the nodes that are responsible for its specific data. Thus, function shipping minimizes communication among processors while relational operators are executed on the processor containing the applicable data.

Function shipping in the DB2/6000 parallel environment is not limited to queries; rather, it can also be used for database updates and running utilities. The hashing algorithm, used for table partitioning, determines the appropriate node that inserts a r ow into a table. During index maintenance, locking and logging processes are distributed across processors, index entries are updated, and row information is logged at the same node. DB2 PE instructs each processor to access only the portion of the datab ase that it owns locally; therefore, a processor does not have to request access permission from remote processors before accessing its local data store. This approach eliminates the need for a global lock table.

DB2 PE is built upon the Data Management Services layer of the nonparallel DB2/6000 engine, but IBM has completely reworked the communication services. The Data Protection Services (DPS) layer of DB2/6000 (responsible for locking, logging, and recovery) can activate more than one process and involve more than one node at one time. DPS extensions use a control message interface for global deadlock detection, two-phase commits, and recovery from system failures.

I/O shipping is an alternative to function shipping, but it increases data movement and decreases efficiency because queries are arbitrarily selected to run on one or more of the group's processors. For any given query, all of the data residing in the pr ocessors is sent to the executing processors. This increases data movement and network traffic significantly. (While DB2 PE does not implement I/O shipping, Oracle uses I/O shipping to move the data from the data nodes to the execution nodes.)

In addition, system configuration differs for I/O-shipping environments in that I/O nodes must be configured with a large number of I/O-specialized disks. There must be a moderate number of disks on each processor, and data must be partitioned over sever al processors. However, both the function-shipping and the I/O-shipping models require the same number of disks for very large databases (50GB or more).

Parallel Execution Strategy

SQL statements are executed using a cost-based relational database optimizer. Several execution strategies are compared for each SQL statement and summarily divided into several separate tasks. The coordinator task runs at the node where the application has connected. Slave tasks (subordinate tasks) accept most of the activity required by the query. These tasks cooperate with one another only when necessary. Slave tasks may appear multiple times, but coordinator tasks appear only once for each applicati on. Data manipulation queries do not have any new restrictions or requirements in DB2 PE, so user investments are protected with their existing applications.

In addition, DB2/6000 applications do not have to be recompiled to take advantage of parallel execution. Both new and existing applications that use data-manipulating SQL statements do not have to be changed upon migration to DB2 PE. DBAs only need to pe rform a rebind so that the optimizer can generate the best plans for existing SQL queries. SQL statements are optimized in a DB2/6000 parallel environment through the specific distribution of data across nodes, as well as through a cost analysis (via the cost-based query optimizer) of the functions associated with different operations.

Parallel execution strategies are generated by DB2 PE's compiler component and are implemented on the following principles:

  1. Cost-based optimization: The compiler generates different parallel execution plans while choosing the least-cost optimization plan. This optimizer is responsible for the parallelism of different operations.
  2. Data distribution: This optimizer comprehensively distributes data and partitions information about the base and intermediate tables in each query while choosing parallel execution strategies.
  3. Transparent parallelism: You do not have to alter applications using data manipulation statements in order for them to execute in DB2 PE. Moreover, you do not have to completely recompile application products for DB2/6000 in order to migrate t o DB2 PE, because the applications only require a rebind to the parallel database.
Query execution may require several logical tasks, and each task may be accomplished across multiple nodes. Coordinator task operators, such as distribute subsection (subtask), can control the runtime execution of slave tasks. DB2 PE requires interproces s communication operators. Query execution can be conceptualized as data flow on trees of operators separated by tasks, with sends and receives used for intertask communication. The query optimizer chooses the optimal join order, the best methods to acce ss base tables and compute each join, and the repartitioning strategy to determine the nodes on which operations need to be performed (the inner and outer tables may not be on the same set of nodes).

Finally, the query optimizer determines the cost of a plan by choosing between system resources and response time. In order to keep the query optimization problem manageable, DB2 PE tracks, on a per-node basis, the total system resources accumulated duri ng the bottom-up generation of a query. Response time is a measure of the maximum resources used across all of the nodes and network. A few of the subsets used to execute a join include: all the nodes, the nodes on which the inner table is partitioned, a nd the nodes on which the outer table is partitioned.

In addition, DB2/6000's query optimization uses a greedy heuristic to choose between different parallel join execution strategies. (This algorithm, when presented with a number of options, will choose the cheapest option available and then proceed to the next step, where it will again choose the cheapest option available, and so on.) For complicated queries, the coordinator returns the answer to the application and also binds any information required to compute the answer. DB2 PE also performs aggregati on functions (such as count) in two steps: the slave tasks compute local counts, and the coordinator sums the counts and outputs the answer to the application.

Parallel Runtime

Query plans or DDL statements are executed in parallel in the new edition, but in order to accomplish this IBM had to amplify the DB2/6000 runtime considerably. Interprocess communication is provided through the following mechanisms (these mechanisms app ly to all DDL and DML statements):
  1. control services, which handle interprocess control message flow
  2. table queue services, which handle the exchange of rows among agents across or within a node (these agents are responsible for proper execution of data flow operators connecting different slave tasks)
  3. communication manager, which routes the messages using the existing communications protocol
The coordinating agent, a unique and independent agent process, is used when dealing with requests to implement the Database Manager, create or drop a database, or monitor an application. The coordinating agent handles all of the requests from one applic ation, and is active only when the application is active.

The parallel agent divides each request into subrequests. It also coordinates the activities of other parallel agents working for the same application, and enables multiple agents to work for one application across several nodes or within one node. Indiv idual nodes have their own pool of agents from which parallel agents are drawn. The Database Manager, to save time on an application and database initialization, tries to have parallel agents process several requests from the same application.

The coordinator executes a query subtask (subsection), which in turn distributes the other subsections to be executed at the proper nodes. It then sends connection information (along with every request) for the table queues, as well as any host variable information that may be required.

DB2 PE is capable of choosing nodes at runtime, and makes its decision based on the query structure. Process creation can be an expensive operation. For long-running queries, the process creation is amortized over several million instructions, but consid erable overhead is incurred for shorter queries. To decrease this overhead, you must perform several optimizations, so that a pool of processes is instituted on each node. DB2 PE creates this pool as required. You can tune the size of the pool with certa in configuration parameters. Process-creation overhead for short queries is reduced by obtaining processes from the pool instead of creating them on the fly.

Table queues are interprocess data flow constructs; you can think of them as temporary tables, but they differ in that they don't have to be built completely (that is, the table never has to be materialized and placed on disk) before rows can be retrieve d from them. Table queues enable the SQL compiler and optimizer to generate the most efficient parallel plan, but each subsection can be executed on more than one node. Each of the sending processes can send each row to one or every receiver process, dep ending on the table queues' associated partitioning information.

The communication subsystem enables parallel communication by performing multiplexing and demultiplexing of messages among nodes. (Multiplexing is sending information from one node to many nodes, while demultiplexing is receiving information from many no des and compiling it onto a single node.) Its delivery mechanism supports the TCP/IP interface.

In serial databases, either the application or the database system handles only one request per application. DB2 PE enables the database system to be active along with the application. It can therefore process more than one query for the same appl ication.

Storage Strategies

DB2 PE provides DDL extensions to allow users to control the placement of database tables across the nodes in a parallel system. Selecting the best storage strategy for database tables is a difficult problem in parallel systems, but three approaches can help: declustering, assignment, and partitioning. Declustering distributes single-table rows across multiple nodes. A table is fully declustered when the rows are stored across all the nodes of the parallel database system. Partial declustering indicates that rows are distributed across a subset of nodes. The degree of declustering is equivalent to the number of table partitions, which are themselves a set of rows that are all stored at one node of a share-nothing system.

The problem of determining the specific set of nodes on which to store table partitions falls to assignment. Overlapped assignments indicate that two tables share one node, whereas a nonoverlapped assignment indicates that no common nodes are shared amon g tables. Fully overlapped assignment indicates that both tables share the exact same set of nodes. Assignment is restrained to full overlapping for full declustering environments, but partial declustering lets you assign table partitions.

DB2 PE provides several efficient ways to load large volumes of data into the database. Load utilities, used at each of the nodes containing a table partition for a given table, enable data to be loaded into a single table in parallel. An input file can be partitioned into multiple files and loaded in parallel (one per table partition) via a data partitioning utility, as well as via APIs.

Data reorganization becomes necessary when you cannot allocate disk space effectively. The physical layout of database tables can change due to insert, delete, and update activities. For instance, insertions sometimes result in overflow data blocks, and deletions may result in gaps. Therefore, table data may no longer be stored on contiguous disk pages. Insert and delete activity can also result in table partitions at some nodes that contain more data than others, creating a skewed data distribution. As additional data accumulates in the database over time, it is often necessary to increase the amount of table declustering in order to accommodate additional data.

Database files can be compacted and reclustered at each node via DB2 PE's Reorg utility. This utility executes in parallel across all nodes containing a partition for a table. A new file can be created without page gaps or overflow blocks. Keep in mind t hat the Reorg utility will require empty disk space -- the amount depends on the strategy used to cluster the table. If a sort is performed, the amount of required disk space will be between 2.5 and 3.0 times the size of the table; if a sort is not perfo rmed, it will be approximately 2.0 times the size of the table.

Some partitioning strategies cause skewed distribution, depending on the attribute values in a given relation. If data is analyzed at initial placement time, it is possible to determine the distribution of input attribute values for minimal data skew. Th erefore, DB2 PE provides analysis tools that analyze the data before it is input into the database.

A data redistribution utility can also help to minimize skew. Data redistribution operations determine which rows of data should be moved (based upon the new partitioning map) to obtain an even distribution of data across the nodes of a nodegroup. For no degroups containing several tables, distributing one table while ignoring the others results in loss of collocation. (Tables are considered collocated when they reside in the same nodegroup and have the same partitioning keys.) The redistribution operati on must be applied to all of the tables in the nodegroup, and each table must be redistributed in turn in order to preserve table collocation.

Data distribution across partitions and nodes can be determined through two new SQL scalar functions, PARTITION( ) and NODENUMBER( ), which return the partition number and node number to the row where the table is mapped. These new SQL functions are impl emented to achieve distribution of the rows in a parts table as shown in the following queries:

Query 1:
CREATE VIEW Partition_Nms(pnum) AS
SELECT PARTITION(PARTS)
FROM PARTS;
SELECT Pnum, COUNT(*)
FROM Partition_Nums
GROUP BY Pnum
ORDER BY Pnum;

Query 2:
CREATE VIEW Node_Nums(Nnum) AS
SELECT NODENUMBER(PARTS)
FROM PARTS;
SELECT Nnum, COUNT(*)
FROM Node_Nums
GROUP BY Nnum
ORDER BY Nnum;

Query 1 results in the number of rows containing the partition number, ranging from 0 to 4095, including the number of table rows that are mapped to the partition. Query 2 creates a view where each set of rows contains the node number and the number of r ows of the table that map to that node.

Backup and Restore for Independent Operation

The number of backup devices available determines the degree of parallelism you can have when backing up and restoring your database. DB2 PE is designed to allow each node in the system to be backed up independently. Simultaneous backup of data from seve ral nodes is possible when multiple backup devices are available. The backup utility creates a single backup image for the entire database partition residing on any given node.

When restoring, it is important to ensure that the restored database partition is harmonious with the other nodes in the system. Therefore, you must either restore all nodes in the system using uniform backup images, or restore the single node and rollfo rward logs to the specific time when the database state is compatible for all nodes.


FIGURE 1


--This figure shows the parallelism benefits of DB2 PE. In Table A, the number of rows is 1,000,000, the row size is 100, and the total size is 100MB. In Table B, the number of rows is 1,000,000, the row size is 1000, and the total size is 1GB.


FIGURE 2


--This graph shows execution times and speedup of a parallel scan that returns 10 percent of the rows in Table B.


FIGURE 3


--This figure shows what happens when you perform an insert into: a temporary table of one row, one percent of Table A's total rows, and 10 percent of Table A's total rows.


Stewart Miller is president and owner of Executive Information Services (Carlsbad, California), a company that independently contracts with organizations to advise, research, and write about the emerging mission-critical IT field. You can contact Stewart at 619-438-0291 or via email at EIS@sd.znet.com.
* IBM Corp., Old Orchard Rd., Armonk, NY 10504; 800-426-3333.
Subscribe to DBMS and Internet Systems -- It's free for qualified readers in the United States
March 1996 Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS and Internet Systems (http://www.dbmsmag.com)
Copyright © 1996 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated Wednesday, November 6, 1996