Windows NT Database Summit

DBMS/Emergent Corp. Transaction Processing Series

DBMS


Scaling Operating Systems. How to pick, tune and scale an operating system to shoulder the weight of powerful OLTP applications. By Curt Mayer
DBMS, March 1998

The choice of operating system can have a profound affect on online transaction processing (OLTP) environments. It will do you no good if a vendor-tuned synthetic benchmark runs very well on your chosen operating system but your application is untunable, unreliable, unmanageable, insecure, or unscalable because of weaknesses in the underlying operating system. Operating systems have special features that must be exploited to support OLTP applications. With the advent of Web-based commerce, inexpensive clusters, and open systems making inroads into the traditional mainframe arena of OLTP, scalable design is more important than ever. This article, a continuation of the Emergent Corp. Transaction Processing Series, covers those characteristics of an OLTP system that are most affected by the operating system. It also looks at networking requirements, including security, and compares two operating systems with a view to scalability. Finally, it presents techniques for tuning and scaling the system and network.

Requirements and Characteristics

OLTP can make significant demands on a computer system. The stresses differ from other types of processing in several important ways. First, an OLTP system usually has extraordinary availability requirements. Second, a large number of concurrent users are connected to the system, and the data-access patterns of OLTP users are difficult to predict. Furthermore, OLTP systems are often the focus of intruder attention. Finally, these systems are usually performance limited, so poor operating system policies can cause degradation that will be visible to users.

A mission-critical OLTP system must not go down unexpectedly, lose data because of component failure, or become inaccessible because of an unexpected network partition. Furthermore, planned downtime must be kept very brief. Some form of predictive maintenance must be used to detect problems before they become visible to the user community. An important part of predictive maintenance is having a capable set of monitoring tools.

Successful deployment of OLTP systems invariably causes a growth in the user community accessing the system. Each connected user consumes an amount of memory, called the user footprint, in the server. This user footprint can be very significant in size, upwards of a megabyte. Each user also uses a number of other data structures that could limit the number of users per server if a fixed number of these data structures are allocated by the system. Ultimately, growth in the sheer number of users is driving the requirement for scalable system design at all levels, because data volume, availability requirements, and network load all grow with the number of users.

OLTP workloads often show a random data access pattern, which is very different from the sequential data access pattern typical of decision-support workloads. The low probability of accessing cached data, as well as the inability to predict which data is needed next, requires a large buffer cache and a low latency disk subsystem. As modern microprocessors are capable of addressing more than 4GB of real memory, scalability can be enhanced by the operating system allowing the buffer cache to be limited by the physical memory of the machine, rather than an artificial limitation of the operating system.

Internet-based commercial systems are subject to many types of attack, including packet sniffing, password hacking, and denial of service. Even in intranet-only systems, you need high levels of security because the OLTP system usually contains the key mission-critical data. Data losses because of operator error are just as damaging as losses from intrusion.

Performance Limitations

A harsh reality of OLTP systems is that performance will be degraded during peak times. In these peak load periods, the system is run at saturation level, and any inefficiencies in the operating system are multiplied in impact. The two main metrics for OLTP performance are latency (the amount of time needed to complete a single operation) and throughput (the number of operations per unit of time). These two metrics often conflict, because optimizing a system for latency will decrease throughput, and vice-versa. Another performance limit is often reached when batch processes are triggered, such as reporting, backup, or data warehouse extraction, slowing down interactive users during these time windows. Engineering a throttle of some sort will limit the impact of these offline processes on the OLTP users.

Operating System Features for OLTP

Many enhancements, such as symmetric multiprocessing, threads, large memory models, low latency networking, asynchronous I/O, network firewalls, and clusters all are becoming mainstream technologies because of OLTP deployment on open systems. A significant amount of convergence in features is apparent in available operating systems, although these vary in how each feature is implemented. Many of these operating system features have standard interfaces, allowing a measure of portability between different operating system vendors.

High-performance uniprocessors can give good performance with simple memory designs, but the introduction of large caches in the early ı90s made available large amounts of main memory bandwidth for the inclusion of multiple microprocessors on the same memory technology. Operating systems lagged behind hardware, however, and for several years there was no effective use of the increased CPU horsepower. Exploiting multiple CPUs requires a thorough redesign of the operating system kernel, because much of an OLTP serverıs execution time is spent within the operating system code. Optional features found in mature SMP kernels that can add large amounts of performance include cache affinity scheduling, processor-specific memory allocation, lazy data structure synchronization, and dedicating processors to tasks. Symmetric multiprocessing is largely transparent to the user, and little or no tuning of this kernel functionality is possible.

All commercial RDBMSs use several concurrent execution contexts to implement the systemıs logic. Usually there will be more contexts in a runnable state than there are available processors, so the operating system selects which contexts should be run at any given instant. This function is known as a scheduler. Various scheduling parameters exist, including fixed and floating priority levels, starvation avoidance, favoring interactive contexts, processor affinity, and virtual memory working set size. A well-designed scheduler will be invisible, and a poorly designed one will cause users to complain about responsiveness, even when load is not at the highest levels. A scheduler usually lets you change the process priorities to favor tasks that an administrator would like to run more frequently.

All contexts within an RDBMS will frequently need to synchronize with other contexts. How the synchronization operations are implemented within the operating system can seriously affect OLTP response time. Application designers are often presented with a rich set of synchronization primitives, and choosing which is appropriate can have a profound impact on scalability. The most simple synchronization primitive, a spin lock, is perfect when contention is expected to be low, but it is disastrous when contention is high. Similarly, a synchronization event is incredibly expensive when used to protect a data structure that is atomically modified infrequently. Synchronization statistics-gathering tools are an important part of a scalability audit; these tools are completely specific to the operating system and, as such, are usually custom built.

Highly optimized asynchronous disk I/O is critical for high-performance OLTP systems; most operating system vendors have focused on this to the exclusion of other functions. However, scant attention has been put on the effective management of large volumes of data. Datasets in the hundreds of gigabytes are never cleanly supported by the operating system, and are becoming increasingly common. RAID-aware file systems are rare, and most application environments have custom-built machinery for manipulating these large data sets. In addition, asynchronous I/O facilities are extremely nonstandardized, especially with respect to request cancellation, synchronization, and coexistence with synchronous I/O. Any application making extensive use of asynchronous I/O will require significant porting to another operating system vendor.

Security

Integrated firewall facilities are required for high-performance, secure OLTP. Add-on firewall products usually only protect a system when connections are established, and they then hand off connections to a trusting RDBMS after authentication. Add-on firewalls can fail badly when IP spoofing attacks are employed, and they are slow as well. Some operating system vendors are starting to support integrated encryption at the network layer, but it needs to be supported across the enterprise before it can be trusted to protect mission-critical OLTP systems. Malicious external users are not the only sources of security threats, however. A coherent security model will also protect your system against inadvertent access of data by data center personnel, runaway programs, and other internal threats. A firewall is merely the first level of defense, with periodic security audits, pervasive use of passwords at all levels, revoking of privileges when employees leave, and so on, all existing in an easy-to-use framework.

Another level of security that is often neglected is protection against failures of the cables, interfaces, switches, hubs, and routers making up the enterprise data network. These components can fail because of misconfiguration, severing, and denial-of-service attacks. A well-designed enterprise network will have redundant communications paths from end users to the data center. These redundant paths greatly simplify maintenance of the network as well. An operating system in this environment must be able to use the redundant communications links automatically when failures occur.

Clustering

Clustering is the act of melding several machines into a single operational system. This can be accomplished in many ways, with widely varying amounts of transparency. The most transparent clusters appear as a single system image and are generally marketed as massively parallel systems. These systems have a well-integrated communications fabric. Clustered systems are also built with standard network technologies, with little or no transparency. Clustering lets you update individual machines with new system or database software or perform hardware upgrades without removing access to the entire database. You can grow a clustered system in response to increased load by adding more machines to the cluster. Also, clustering allows fault tolerance by having client programs connect to surviving machines when a database server machine fails. Tools to manage clusters are currently a great weakness, and operating system vendors have not standardized these functions, so every cluster looks different to the data center staff. A well-designed cluster will also include transparent support for redundant network connections.

Data Redundancy

RAID technology is rapidly maturing, and it is widely deployed in OLTP environments. Inefficient RAID implementations can seriously degrade OLTP performance. RAID provides a varying amount of redundancy and performance, depending on how expensive a system is being built. RAID level 1, mirroring, uses at least two exact copies of the same disk data. If one fails, the other is used transparently. RAID level 5, interleaved parity, uses a mathematical property called parity to regenerate lost data. Parity requires less additional storage than mirroring, but it exacts a heavy toll in performance. If for cost reasons you require RAID 5, you should take great care in selecting the implementation. Naive implementations of RAID 5 may not protect against all common failure modes. Host software-based parity RAID is seldom a good idea for an OLTP system, because latencies will be much worse than a hardware implementation, and OLTP workloads are often limited in disk latency.

Many OLTP systems, such as Web commerce servers, have no obvious interval where the system can be taken offline for backup. This sort of environment requires an online backup system that does not cripple OLTP performance. Many operating system vendors have integrated third-party tools that meet this requirement into their products. Scriptable interfaces for tape robots, tape-volume man- agement, backup verification, and automatic multitape capabilities are all essential features of backup support.

Operating systems for scalable OLTP must gracefully support the installation and manipulation of large numbers of disk drives. An administrator needs either a powerful GUI or a fully scriptable command set for creating volumes, adding disks to a volume, formatting disks, managing space on a large number of volumes, and other disk space-management tasks.

A performance-monitoring facility is the most important tool for system tuning, and it is critical from a scalability standpoint. Common tuning tasks include correlating statistics gathered at many different layers of software in order to find the cause for some performance anomaly. A high-resolution timestamp associated with each measurement simplifies this otherwise difficult task. A graphical interface is useful when a human has isolated an interval of interest, but sometimes you need statistical analysis of large logs of performance to find that interval. A powerful monitor will inevitably affect the system being monitored, so some care must be taken not to measure too many parameters too frequently.

Comparing Operating Systems

Letıs compare two common operating system offerings to illustrate the scalability issues mentioned previously in this article. Then letıs explore future directions for these and related systems.

Windows NT 4.0

Windows NT 4.0 is Microsoftıs flagship enterprise operating system. (See the cover story "Running on NT," DBMS, December 1997) It shows a serious attempt to build an industrial-grade environment for commercial applications. Although Windows NT 4.0 still has many rough edges, Microsoftıs commitment and marketing muscle will ensure the ripening of this platform.

Windows NT has weak symmetric multiprocessor scalability because of architectural problems in the underlying kernel. Programming support for threads in Windows NT is excellent, with a wide variety of synchronization primitives of varying performance impact. The ease of building thread-based programs partially offsets the cumbersome overhead of the microkernel architecture.

Windows NT supports RAID, both in mirroring and parity configurations. However, the disk administration tool can only manipulate a relatively small number of volumes before it becomes unwieldy. Windows NT has limited clustering support. Clustering is currently supported for failover only, and a low-latency message transport programming interface is entirely absent.

Security with Windows NT is robust within the machine but weak on the network. A variety of incompatible network authentication schemes make it difficult to administer security across the entire network. Scripting support is very weak, making the automating of system administration tasks cumbersome. In addition, third-party tools addressing these weaknesses are not easily integrated with one another.

Windows NT has a superior performance monitoring tool that collects software and hardware statistics from many different layers. You can easily dump statistics into a spreadsheet for analysis, and you can display correlations in real time.

Solaris 2.6

Sunıs fifth-generation Unix-based operating system is a mature, powerful operating environment embodying most features needed for OLTP systems. Solaris has unmatched symmetric multiprocessor support, scaling up to 64 processors. A monolithic kernel keeps system overhead down, and the kernel is well-designed to avoid unnecessary synchronization. Sun ships a software development suite specifically aimed at building multithreaded applications, but most programs use heavyweight processes instead of threads.

Solaris has excellent support for RAID in both hardware and software. Very large numbers of disks are cleanly supported via a scriptable administrative interface, and a graphical interface is also provided.

Scripting support is strong. Automation of administrative tasks is well-supported with batch scheduling, a capable system-logging facility, and a variety of powerful scripting languages.

Solarisıs performance monitoring is rudimentary compared to Windows NTıs performance monitor. Different software products produce different forms of statistics, hindering correlation. Third-party monitoring packages improve matters somewhat.

Solaris and other Unix systems have a strong lead in scalability over Windows NT. Microsoft is not standing still, however, and the company has developed some innovative technologies ı notably the performance monitor ı and will add state-of-the-art clustering support soon. Windows NT must coexist well with heterogeneous environments in order to penetrate into mission-critical OLTP environments, and this should happen within a few years.

Tuning a System

You can tune an installation of an OLTP application at many levels. These levels interact significantly, requiring that any tuning effort proceed in a top-down manner. A critical component of tuning is measurement, with statistics gathering at each level. After exhausting the tuning opportunities at the highest levels, the operating system-level statistics should be observed for unexpected measurements. Tuning requires a detailed understanding of the interactions among the various components of the system. This understanding can best be gained by experimentation within the following framework.

The first step in tuning is to define a baseline. This workload is characteristic of some interval of interest. A synthetic benchmark is always a poor choice if a real workload is possible. A baseline should be short, since many tuning iterations could be required, and the entire tuning process may need to be repeated if significant configuration changes happen. If possible, you should use an identical machine configuration for tuning, because operating system tuning often requires rebooting the machines, and operational systems often cannot tolerate downtime.

Measure the performance of the baseline using a wide variety of performance statistics. Note each performance number, and compare it against the theoretical maximum, if known. Any metric that is at this limit cannot be improved; these are candidates for detuning, which enables some low performance number to be increased at the high numberıs expense.

A tunable system has parameters that can be adjusted to cause a change in performance. Rarely does changing one parameter change just one measurement, and often changing one parameter causes an increase in one metric and a decrease in another. For example, increasing the size of the disk buffer cache will result in less disk I/O and more transactions, up to the point where paging starts, at which time disk I/O starts to rise and transactions drop. Some parameters are sensitive, and a small adjustment can cause a large change in performance. Some parameters show no change at all under the baseline load. Once a baseline has been established and the behavior of the various parameters has been elucidated, you can begin the iterative process of changing one single parameter at a time.

The goal of tuning is to maximize the number of performance metrics that approach the theoretical limit. A well-tuned system exhausts all system resources at the same time. There may be other constraints, such as maximum tolerable user wait time, that require some resources to be underused. A common case is that one disk drive is very busy and no other disks are busy at all; performance is low. Spreading the data found on that drive over many disks will cause the average disk activity to increase, and performance will rise ı until some other limit is reach- ed, that is.

Visualization tools can be valuable for finding correlations between parameter changes and performance. It is easy to see why measurement is critical with this kind of tuning. Measurement is not free, however, and you should prune the number and frequency of your measurements to as small as possible.

Scaling a System

System scaling is closely related to tuning; when a resource is at its limit, a scalable system allows you to add more of that resource. Similarly, measurement should be used on an ongoing basis to predict when scaling should occur. A scalable operating system should allow simple addition of the relevant resources. Adding disks to a volume should be easy, as should adding memory and CPUs. Adding network bandwidth may be easy or very difficult, depending on how much additional network infrastructure must be added.

A variant of scaling is increasing the number and type of functions the system performs. As new users are added, some of them may have different or new demands on the system. This can have profound effects on the workload, and it may require that you identify a new performance baseline. In particular, mixing OLTP workloads with a small number of long-running queries can seriously hamper performance.

Summing Up

Evolution in commercial operating systems has been largely driven by the demands of OLTP workloads. The details of building and maintaining a scalable system are operating system specific, although most of the techniques at an abstract level are not. Tuning a system is an iterative process using low-level understanding of a systemıs behavior and measurement to maximize performance. This low-level understanding, as well as the actual adjustment of system parameters, is very system-specific. Operating system selection is a trade-off situation, and knowing the strengths and weaknesses of your operating system is an important part of building, tuning, and scaling an OLTP system.


Curt Mayer is cofounder of Emergent Corp., a San Mateo, California-based consultancy that specializes in the design and delivery of highly scalable systems based on commercial parallel processing systems. You can email Curt at curt@emergent.com.


What did you think of this article? Send a letter to the editor.


Subscribe to DBMS -- It's free for qualified readers in the United States
March 1998 Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS (http://www.dbmsmag.com)
Copyright © 1998 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated January 29, 1998