DBMS

Server Automation Principles

By Howard Fosdick
DBMS, February 1996

Practical approaches to automating database server administration and maintenance can unshackle the tired DBA.


As IS organizations evolve toward client/server, managing distributed database servers becomes increasingly important. IS staffs that once managed a single, centralized data center are now responsible for several database servers scattered throughout the company. Some of these servers may have been brought in without IS's involvement, or they may not conform to organizational standards. Whatever the case, IS must support these servers.

Several factors make the administration of distributed database servers difficult. First, the administrative tools for this market are still immature. While several of these types of tools exist, they are often "point" solutions rather than comprehensive solutions. Second, many of the available tools are not well integrated. For example, network administration is often distinct from remote operating systems administration, which is distinct from database administration. This lack of integration makes hands-on administration with the tools that much more time-consuming. Plus, the operating systems' platforms themselves may lack the connectivity necessary for remote administration. For example, Unix is well-known for its built-in remote login and file-transfer capabilities (via the "telnet" and "ftp" commands), but some server platforms still lack such features. Finally, IS staffs are often ill-prepared for their new roles as caretakers of potentially dozens or even hundreds of dispersed database servers.

What's to be done? One solution is simply to slow the pace of server distribution until the technology for administration catches up. However, organizations are loathe to delay. Client/server technology is driven by business needs and user departments, and if it does not address these needs, IS fails the business.

Another approach is for IS to automate remote database and server administration itself. Although this may at first sound like a daunting task, it may be much more practical than it appears. IS-written code can ride on top of and supplement whatever administrative tools the company buys. The quality and breadth of these underlying purchased tools are rapidly increasing. Furthermore, IS best knows its own distributed topology and local needs -- even when IS purchases administrative tools, it must still customize them to its needs.

This article explains how to automate database servers, including administration and maintenance, using technology that is available today. The systems I describe in this article are real systems, drawn from my own experiences. Any product-mentions are not endorsements; rather, they provide specifics about the projects I discuss.

Concepts

Before describing specific solutions, let's define some underlying concepts. One key function of automation is to respond to occurrences within the system. Examples of such events might be when a file reaches 90 percent capacity, or when a program has successfully completed execution. Events that refer to some relative condition (a file is x percent full, a CPU is x percent busy) are often called "thresholds." An "event alerter" is a mechanism that identifies when an event has occurred or a threshold has been reached, and notifies a person or program in a predefined manner. For example, an event alerter might notify an operator that a file system has become corrupted by displaying an emergency message on the operator's system console.

"Agents" are programs that reside on specific systems and feed information back to some central collector. For example, a database agent might reside on each of five database servers. At periodic intervals, when specific events occur or when certain thresholds are reached, these agents might report back to a central monitor. The monitor acts as a coordinator or control program for the distributed database agents.

Agents may operate passively, merely in the roles of distributed data collectors or event monitors, or they may be active. Active agents change or alter the environment in which they run. For example, a passive agent might notify an operator or monitor program that a file system is full; an active agent would allocate more disk space and correct the file space condition.

Fundamental Principles

Five key principles underlie server automation. (See Table 1.) First, programs should fix problems. The traditional approach is to notify operators or technical support personnel of a problem that needs fixing. Technology renders this mindset obsolete. Automate as much as possible.

While this principle sounds obvious, almost every IS site manually monitors or intervenes to correct conditions that could easily be automated. For example, suppose a DBA schedules a reorganization utility to run on a particular database overnight. Why doesn't the system have the intelligence to auto-schedule this activity? Why is human intervention necessary to accomplish such predictable maintenance tasks?

As another example, suppose the reorganization utility fails due to lack of workspace. This is a common cause of interrupted sleep for DBAs across the country. Why can't the reorg process itself identify and rectify such a predictable, simple problem? An "intelligent reorg" would dynamically acquire more space as needed to ensure that the reorg completes. Surely, the reorg should be intelligent enough to preallocate the workspace necessary before running (in order to avoid bombing later). A judgment call must be made in order to ensure that a reorg has sufficient workspace to complete its execution, but there is no reason why a DBA has to provide this judgment. Rather, the reorg or a utility monitor could easily supply it.

As trivial as it may seem, IS organizations often do not take server automation to its logical conclusion. I can only speculate why. Most IS organizations do not yet think in terms of total automation. Instead, they unconsciously accept limits on the possible based on past experience. Both the urgency of the problem (due to server proliferation) and the universe of possible solutions (due to advancing technology) have moved well beyond the limits of "data center automation," which were accepted as givens just a few years ago.

This leads to the second great principle of server automation: If an agent or monitor MUST notify a human of a problem condition, it should do so in a manner most appropriate to the human. You can see how important this principle is by looking at how many IS organizations handle notification today -- not in ways most convenient to human needs, but instead reflecting the technological limitations of times past. For example, suppose a Job Scheduler detects that a program fails in the nightly schedule. It notifies an operator by displaying a message on the console. The operator then pages an on-call programmer to support the failed application program.

Is this the most useful approach? The Job Scheduler should either handle the job rescheduling process or restart. If it cannot (thereby violating the first principle), it should, at the very least, do a better job of notifying IS. Displaying a message on the system console requires an operator to look at the console. There may not be a way to verify that the operator has read the message and/or acted upon it. A better approach is for the Job Scheduler to contact the support programmer directly, without the intermediate step of operator intervention. How should it notify the programmer? A direct call to a pager or a telephone call are a couple of ways. The bottom line is to design notification systems for human convenience, rather than basing them on past practices or on technology limitations that no longer apply.

The third principle of server automation is location transparency. Location transparency means that any administrative or automated function operates on a remote database server in exactly the same manner as if that server were local. With increasing numbers of geographically distributed database servers, the need for location transparency is obvious. Only through location transparency can centralized staff or small staffs manage and administer remote database servers. Although location transparency is the ideal (and is a requirement for many large companies), the reality falls short. Not all operating systems have strong remote management capabilities, and not all DBMSs offer good distributed database features. The tools for remote administration are still evolving.

The fourth principle is to limit heterogeneity. Automation is much more feasible when dealing with three -- rather than seven -- operating systems. Similarly, database automation works better with one -- rather than four -- DBMS products. And network administration is easier with one -- rather than three -- protocols. Automation requires programming that is aware of (and dependent on) the software that it controls. It may be possible to automate on top of all products of a given type, but it is not practical. For example, an intelligent reorg monitor must have some knowledge of the database product it controls. It's much easier to program this monitor to handle DB2 for Unix database servers than it is to program the utility to handle DB2 plus Sybase SQL Server for HP-UX, Oracle for OS/2, and Gupta SQLBase for NetWare.

The final principle revolves around the concept of components. Buy all tools possible, then program on top of them as little as possible. Ideally, components integrate with each other, or they at least provide a common interface for user programs. Modularize your own code by component or interface. Do everything possible to buffer your own code from changes. You must think like a vendor: "How do we generalize our system to the maximum extent possible for maximized reuse, and still limit our maintenance exposure?" And you must think like a user: "Let's buy everything possible, then do the minimal programming required to customize those prebuilt solutions to our environment."

Buy as many comprehensive components possible and build as little as possible on top of them. Given the primitive nature of server administration tools today, you'll still have to write a lot of code to automate your database servers completely.

Real-World Examples

Examples of database server automation follow. They are all real-world systems from my experiences. The examples do not address all possible problems, but they illustrate practical application of the principles I outline in this article.

Case 1: Systems Availability. A department has a dozen servers supporting its operations. Some are database gateway servers, which provide data movement to and from corporate mainframes. Others are application servers. At the heart of this system are four database servers that support the DBMS functions of the desktop-based client programs.

Server availability is of utmost importance in this system. Expected hours of availability are from 1:30 a.m. to 6:30 p.m., six days per week. The systems must experience no downtime during the 17 X 6 window. One way to achieve this expected availability is to rely on humans to oversee the operation. Data center operators can watch over the systems and ensure availability. However, this approach imposes a time lag before technical support staff can be informed of a problem. Once a problem is recognized, more time would be spent during manual "problem identification" and "resolution." Automation promises greater availability and quicker response to downtime.

There are two aspects to the automated solution. I call the first aspect, which addresses hardware failure, "hardware fault tolerance." You can achieve hardware fault tolerance through any number of the hardware and software solutions provided by equipment vendors. These solutions include high-availability clustering that fails over to a secondary host in the event of a primary machine's failure, as well as various forms of RAID, DASD mirroring, attaching DASD to multiple adapters and hosts, and so on.

My focus in this example is on a separate aspect of the problem: ensuring availability of software systems; that is, the operating systems on the machines (such as IBM's AIX Unix), the DBMS (such as Sybase System 10), the database gateways (such as MicroDecisionware's DB2 Gateway, now owned by Sybase Inc.), the Job Scheduler (such as QMaster from GD Associates Ltd.), and SNA connections. These software systems could fail independently of the hardware. How can you ensure reliability?

The project team in this case study wrote a generic systems availability program. (See Figure 1.) This monitor verifies that the software systems running on a particular machine are up and working. If not, it restarts them. The monitor cross-checks other machines by running the same system availability routines on them remotely. If any of the remote software systems are unavailable, the monitor restarts them too. The systems availability monitor runs on all key machines in the complex. A simple scheduling facility (in this case, the Unix crontab facility) executes the monitor at predefined intervals. Shorter intervals reduce mean time to restart software in case of failure, but this approach also consumes more overhead for the monitor "daemon." This site runs the program on selected machines (which I explain in a moment) every 10 minutes during the availability window.

The value of the monitor is that it actively rectifies any availability problems it uncovers. It does not require human intervention unless the software system at fault cannot be restarted via program intervention. The program logs its data analysis and decision-making activities so that system administrators (SAs) can view them later. In 18 months of operation, the monitor has successfully ensured that software systems were up in all cases, exclusive of hardware failure.

What if the availability monitor identifies an unavailable software system but cannot restart it? This brings up the issue of how to notify technical-support personnel. The goal is to notify SAs quickly for quick response. The notification vehicle must be convenient and immediately "at hand," from the SA viewpoint. Writing a message to the systems console is inadequate because these servers are part of a totally automated environment -- there is no operator at the console to read a message. Ditto for printing a message or report -- they may not be seen or picked up for hours.

Sending an email to the support team is another option, but it assumes that someone is logged into the email system and scanning for messages. You could set up the email system to notify (beep) when a critical message arrives, which is a good alternative for other situations, but, in this case, the 17 X 6 availability window implies that there are times when no one will be logged onto email to respond.

Another alternative is to deliver a message via pager. Pagers are small and light and can be worn on clothing. Various software packages are available for Unix, OS/2, and other operating systems that help user programs send textual messages through pagers. The user program typically interfaces to the pager program via an API. The program can beep the pager and send a descriptive character string of up to a couple hundred characters on the pager. An intelligent notification system would use the descriptive string to specify the exact nature of the problem. It should state the time, the machine and software system experiencing the problem, and a problem description.

In this project, however, the team opted for a voice message delivered by telephone. Machine-synthesized voice messages, which can be received via any standard or mobile telephone, describe the problem, support a rotating "call list" of support personnel, and are responded to by pressing buttons on any touch-tone phone. To support this infrastructure, a speech-synthesis system called DECTalk (Digital Equipment Corp.) was purchased to place the phone calls. (See Figure 2.) DECTalk is a special-purpose computer that attaches to the RS-232 port on a host machine. Control of the DECTalk box is achieved by sending control characters and character strings across the RS-232 line. DEC provides an API that makes it easy to control DECTalk, or you can program it yourself at a lower level, via any programming language capable of sending character strings through the host machine port.

In this project, programmers wrote a short Fortran program to the DECTalk API. This program can be invoked by any other program residing on any host in the server complex. Input parameters to the Fortran program include a list of phone numbers to call (in their preferred calling order), and a character string with the message to articulate to the recipient. The Fortran program calls the numbers on the call list, one after the next, until someone answers his/her phone and accepts responsibility for the call by pressing "*" and "1". The recipient hears the message via the speech-synthesis capabilities of the DECTalk computer.

The Fortran program ensures that someone accepts responsibility for the situation by requiring a valid response of "*" and "1". In the absence of this response, DECTalk calls the next person on the call list. This avoids the problems (such as answering machines) that some voice-call systems encounter. A key feature of DECTalk is this ability to recognize feedback from the call recipient in the form of touch-tones. The Fortran program, for example, could be enhanced into an entire system of menus, with actions dependent on recipient responses. It is possible to achieve a level of automation such that the recipient could run programs or perform other system actions merely by pressing touch-tones in response to the calling program's prompts. The team did not take advantage of this approach in this project because it was considered overkill for the particular problem.

Case 2: A Simple Performance Monitor. Operating system and database performance monitors are complicated pieces of software. You don't want to create them unless you have to. My motto is: Buy, don't build. In this particular case, however, a client had a special situation that was not addressed by the existing DBMS tools. Several different activities would occur on the production database server during the day that adversely impacted response time for otherwise quick production queries. One problem was that developers would perform unauthorized work on the database server because the server was not locked down. Another issue was that developers or users would run programs that resulted in mass loads during prime time. Yet another problem was a bug in the DBMS itself. Under certain conditions, the DBMS would flip into a high CPU state, thrash, and perform little useful work. Finally, there were several other conditions under which the server would end up consumed by low-priority work that was less critical than quick, simple user queries. Queries that required quick responses were often blocked by long-running programs.

To combat the problem, the server team wrote a simple monitor program called "Assassin" that periodically scanned for any of these conditions. Upon detecting any of these problems, it responded by canceling the offending thread. Using this approach, it was possible to keep the production server functioning smoothly. A key part of the Assassin program was that it logged the conditions it identified and corrected. This information was then condensed into a report that tracked the frequency and nature of system problems.

Based on the intelligence embodied in the report, the client was able to change its administrative procedures to address the root causes of the production server problem. Developers were given servers with databases that mirrored the production server, and they were therefore moved off the production machine. Mechanisms were also defined and enforced to prevent loads and other problem-causing programs from running during the day. The DBMS vendor corrected its exposure to the thrashing state. Thus, the Assassin program served adequately for a year in its role of server-policeman. More important, it identified the underlying causes that the client was then able to resolve directly.

The database monitor program embodies several key points about server automation. First, it is a daemon, which is a program that awakens and runs at periodic intervals. For example, the program might run every 15 minutes during critical activity periods. The concept of the daemon comes from the Unix world, where daemons can be easily scheduled via Unix facilities such as crontab. Most other multiuser operating systems have equivalent facilities for creating daemons. The daemon concept is fundamental to automating several server functions.

Second, the database monitor corrects the situation itself. It does not rely on human intervention to correct the problem conditions that it identifies. This concept underlies "total automation," which is achieved only via self-correcting systems. To set up self-maintaining servers, notification must be an afterthought, not a prerequisite to an immediate solution.

Although the Assassin program was not developed further, it is an example of the kind of server automation that can be achieved. For example, it embodies rudimentary artificial intelligence techniques in its decision-making processes. I hope that tools vendors will increasingly embed intelligence within their products using such techniques. There is no reason that DBA and SA expertise cannot be injected into intelligent server agents that can maintain those systems with little human intervention.

Case 3: High-Reliability Download. A site downloads DB2 mainframe data into a group of Unix-based database servers nightly. All updates are applied to the mainframe systems, and the nightly download moves the data to query-oriented database servers. The downloads must occur after all mainframe updates are complete, but before users reach their desks in the early morning. Therefore, the download window is from 1:30 a.m. to 7:00 a.m.

A key requirement of this system is 100 percent reliability of the download process. If the data on the servers is not current or not accessible at 7:00 a.m., the system has little value to the organization. A few such failures would render the system worthless in the eyes of its users.

Thus, the development team wrote a small set of programs to create a server-initiated download system that features high-reliability and guarantees notification of support personnel when a failure occurs. The underlying purchased components were MicroDecisionware's (now Sybase's) Database Gateway for DB2 and the QMaster Job Scheduler for Unix. The MDI Gateway extracts data from DB2 tables, performs any necessary data type translations, and places the data directly into tables on the target database servers. The QMaster Job Scheduler schedules and controls all nightly download jobs.

The download program itself is a locally written C program. This program drives the MDI gateway in batch mode, controlling it and ensuring successful completion of the download processes. The C program is data-driven, meaning that in each run, it executes all the download statements residing in one Unix directory. Many of these statements are MDI transfer statements; that is, they are the basic gateway statements that select data from DB2 tables and download it into tables on the target database servers. The data-driven nature of the download program means that it can target multiple database servers, run in parallel with other copies of itself, and support data uploads to the mainframe.

Various Korn and Rexx programs were developed along with the C program. These provide for the set up and control of the C download program, the MDI Gateway, and the QMaster Job Scheduler. One Rexx program encompasses the entire system and provides a single, menu-driven interface. This Jobmonitor program offers one point of control to operators, as well as to programmers who maintain the download directories and support personnel.

Key to reliability is the C program's ability to pick dynamically alternate paths to connect the mainframe and database servers. These paths duplicate the entire connection between the mainframe and database servers, with alternate routes via two MDI gateways, running on two different gateway servers, using two different controllers, with various connection IDs, and so on. The download program automates the process of path selection and ensures that if any working path is available, it is used.

If an error occurs, the C program collects diagnostics on the problem. Two mechanisms are used to guarantee notification of support personnel. Duplicate notification mechanisms fulfill the system's goal, which is to ensure that downloads are completed on time, at all costs. Operators monitor the Jobmonitor program, which provides a single panel that lists any callable error conditions. Second, the DECTalk voice-synthesis system described in the previous sections calls support personnel on the current call list and notifies them of any error. The system is intelligent enough to know who to call. If an error occurred in the data inputs driving a download, for example, the program calls the programming group responsible for that data. If the error is of a systems nature, the program calls the technical support team for the download system.

The system automatically handles most system-oriented problems via dynamic path selection for connection to the mainframe, and it can restart the database servers, gateways, and so forth. However, the system does not automatically restart a failed download job. Instead, it enables programmers to restart any failed download job via the Jobmonitor interface. Although job restart could have been automated, the complexity of doing so versus the infrequent occurrence of this problem argued for notifying support personnel instead.

In its first year of operation, the system has ensured that data has been delivered to the database servers on time. Because the system automatically handles server-related problems, calls for help are infrequent. And when calls are made, they almost always involve mainframe system availability (DB2, CICS, VTAM), which the download system cannot manage directly.

The download system took one person three months to develop. Little design time was required, because the programmers had run and overseen their downloads manually for a year, and they knew quite well what was required for system automation. The automated download system took about 10,000 lines of code. The C program was 2500 lines, the Jobmonitor was 3100 lines, and other supporting Korn shell and Rexx programs added up to 4500 lines of code.

Case 4: Intelligent Reorg. A final example of database server automation is a Basic Intelligent Reorganization utility, or BORG. BORG is based on a standard Reorg utility under OS/2 that supports DB2/2. It includes backup steps for the database and operating system, as well as database reorganization.

The intent of BORG was to go one step beyond how backups and reorgs are typically handled on small database servers. In the system-design process, support programmers were asked to write down the reasons for failure of the nightly backup and reorg programs. For the backups, failures were due to bad backup media, unavailability of the backup media, and so on. For reorgs, problems included lack of workspace conditions, insufficient workspace to start, improper scheduling at the same time as an incompatible program, and incorrect utility parameters due to changes.

For backups, BORG handles error conditions intelligently by providing alternate forms of backup. In this case, disk-to-disk backup is substituted for tape backup. For reorgs, BORG intelligently checks the database status (by comparing current catalog statistics to updated statistics), decides whether any Reorgs are necessary, and, if so, runs them. BORG oversees the reorgs and fixes the kinds of errors that occurred in the past. Logs trace BORG's decisions, and emails to the DBAs notify them the next morning of any important activity. BORG also handles any consequences of its actions, such as the need to re-bind static database access plans.

Unlike the high-availability systems of the previous examples, BORG is low-criticality. Its goal is simply to automate what can be automated, and to ensure that scarce personnel are not needlessly bothered over mundane problems with nightly backups or occasional reorgs. BORG does not handle all possible error conditions, and it is not a fault-tolerant process. Rather, it is an example of simple automation that fixes the most common problems.

The Future

In this article, I've concentrated on the practical; that is, I've outlined what IS can do today to automate database servers. What about the future? Several trends are evident. Vendor administration tools will become more powerful and location-independent. Network control, database administration, and operating system maintenance tools will integrate more efficiently. Agents and intelligent agents will proliferate. Standards for interagent communication will become widely accepted.

In addition, remote database servers might evolve into an IS-provided utility, much like file and print servers have. In the early and mid-1980s, file and print servers were often installed and maintained by user departments outside of IS. Over time, IS stepped in, took responsibility for these diverse LANs, and consolidated and standardized them. Today LANs are company-wide utilities that are taken for granted by many users, while in the background, IS installs, administers, and maintains them.

Database servers might follow the same path. Many client/server projects today are departmental undertakings. With the evolution and integration of tools for remote administration, IS might support a "database server utility" in the future. In this scenario, central IS services would install, set up, and maintain the hardware, operating system, DBMSs, and communications components in order to supply a basic database server to user departments. Program development, customization, and software package installation would be performed by the local project team in charge of each database server.

Whether database servers are consolidated or not, tools for remote administration will continue to improve. But customization and programming for automation will still be necessary. Only IS knows the specific requirements and topology of the corporation it supports. Hopefully, the experiences in this article will give you ideas as you tackle your own server automation projects.


Howard Fosdick is an independent consultant who works as a client/server architect, DBA, and systems administrator. You can reach him at Fosdick Consulting Inc. via the Internet at 75403.1772@compuserve.com.


* GD Associates Ltd., 160 Bayview Dr., SW, Calgary, Alberta, Canada T2V 3N8; 403-264-8322 or fax 403-265-5307.
* Digital Equipment Corp., 2 Result Way, Marlborough, MA 01752; 508-493-5111 or fax 508-493-8780.
* Sybase Inc., 6475 Christie Ave., Emeryville, CA 94608; 510-922-3500 or fax 510-922-9441.


FIGURE 1


--The systems availability-monitoring program verifies that the software programs are up and running. If not, it restarts them. It also cross-checks other machines by running the same systems availability routines remotely. If any of the remote software systems are unavailable, the monitor restarts them too.


FIGURE 2


--Any program in the server complex can invoke a Fortran program that interfaces with the DECTalk API. The Fortran program calls the numbers listed on the call list until someone answers his/her phone.


TABLE 1
Five Server Automation Principles

1. Programs should fix problems
2. Use appropriate notification
3. Achieve location transparency
4. Limit heterogeneity
5. Buy and customize components; don't build



Subscribe to DBMS and Internet Systems -- It's free for qualified readers in the United States
February 1996 Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS and Internet Systems (http://www.dbmsmag.com)
Copyright © 1996 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated Monday, November 11, 1996.