DBMS

Design and Replication Issues with Mobile Applications, Part 1

By Glenn Froemming
DBMS, March 1996

Exploring the Complex Replication Issues Inherent in Mobile, Sales Force Applications.


At the end of 1995, it seemed that just about every major RDBMS vendor had announced or shipped a form of replication server capability. The failure of two-phase-commit, and the network bandwidth and availability limitations in real-world distributed, cl ient/server systems has propelled replication technology to the top of the feature list for distributed database applications. I work for Sales Technologies, a company that has been providing mobile, replication-based, sales force automation (SFA) system s to our customers for more than 10 years.

Until now, off-the-shelf replication services have not been available, so companies like mine develop proprietary solutions that are specific to particular market segments. To enable us to address broader markets, we have rearchitected our latest generat ion of replication technologies to meet new requirements. Our new software, which we call the Open Data Delivery System, addresses the requirements of heterogeneity (that is, the connectability to a variety of RDBMSs), and what I call heteromorphism (the ability to identify and address data elements under different names and shapes in different operating environments). As a designer of replication technologies, I take great interest in the newest replication services being offered by almost every major RDBMS vendor.

Most of the current wave of vendor-supplied replication services are targeted for data warehouses, capacity relief, hot backup, and client/server performance. This article explores the issues inherent in replicated applications, and, more specifically, i dentifies those issues inherent in mobile applications. Due to the high node count, mobile replication is a difficult problem that is not addressed by available products.

Background

SFA systems are typically provided to the field user via laptops or pen devices. For the highest level of mobility and availability, the systems are designed as a highly replicated, distributed, disconnected application. Every day or two, the field user calls a central server site to synchronize with the server database and to exchange a variety of other information with the server system and other users via email, electronic reports, and remote configuration management files.

Our SFA systems have integrated with RDBMS systems since 1990. SFAs are used to support mobile sales forces of anywhere from 100 to more than 1500 field units. The field units utilize a different RDBMS than the server environment, and our server componen ts must be able to operate against any of the popular server RDBMSs. The system also provides facilities to exchange data with other corporate data feeds and third-party data providers.

While building and supporting these systems, we have identified a long list of interesting peculiarities of highly replicated systems. Many of these peculiarities are very subtle. In this article, I discuss a few of the more prominent ones, in the hope t hat the information will help you when you are evaluating currently available replication technology. I also provide some guideposts to be aware of in the design of replicated systems, and suggest a few questions to ask your replication services vendor b efore signing up for the program. (The terms I use in this article are listed and defined in Table 1.)

The good news about replicated systems is that they can provide the solution to performance and availability problems. The bad news is that replicated systems are difficult to build and may be even more difficult to manage. Before I look at these issues, however, I'll take a quick look at the various forms of replication.

Forms of Replication

Replication can be provided in numerous forms and combinations. For instance, there are simple, single-updater systems and multiple-updater systems; the latter are also referred to as bidirectional or symmetric systems. Data can be replicated at the disc rete table-row level or the transaction level. Replication can also be provided through a single distribution hub, or by forwarding through several intermediate distribution points to the final destination sites. Realistically, it is doubtful that the propagation delay and risk would be tolerable through more than a very small number of sites.

Single-Updater Systems. If you have a system where only one user has rights to update the data or one group whose updates can be serialized through one data source, your replication problems will be fairly easy to solve. You will have to deal with the remote configuration management issues involved with distributed systems, but several vendors are beginning to provide tools to that end, including Microsoft SMS, Oracle BattleStar, Stream, and RemoteWare Xcellenet.

In addition, the replication is performed through read-only copies or "snapshots," and you don't have to worry about data collisions (the replication version of concurrency management) or sequence generator conflicts. You also don't have to worry if all of your remote nodes have synchronized in order to reorganize the server, or any of the other issues I describe in the next few sections. If you can confine your application to a single-updater model, more power to you. It will be a much simpler system t o build and maintain.

One issue you may not escape, however, is that of the heterogeneous nature of today's database application world. It may be that you have a mainframe DB2 environment in one location, Oracle/Unix at another node, Sybase/VMS at another, and so on. And whil e the replication services provided by many of the RDBMS vendors claim heterogeneous capability via gateways, this approach has performance implications, adding yet another level of performance degradation at the source end of the replication process. In particular, if one node must replicate to a large number of heterogeneous remote nodes, the performance implications can become particularly important.

Multiple-Updater Systems. If any of the replicated tables must be updated at more than one site, things get much more complicated. The most obvious issue is data collision. This is a variation of the concurrency problem typical in OLTP database sy stems. There is no locking mechanism available in asynchronous replicated systems -- users at two or more autonomous sites can be updating the same data at about the same time. At replication time, the replication server must provide mechanisms to detect and resolve such data collisions.

A more subtle part of the data collision problem has to do with what happens after a collision occurs. Should all replication to that node be suspended until the problem is resolved? Where will the problem be reported? Who will resolve the problem? What tools are available to correct the problem? What is the impact on the application in the meantime?

In part two of this series, I will discuss some forms of collision detection and resolution mechanisms that various vendors already provide. Keep in mind, however, that you must account for the types of mechanisms available and design within or around th ose capabilities.

Other Considerations

Database operations can be replicated as discrete table operations or as complete transactions. Complete transactions are preferable, but some replication servers do not track changes at the transaction level and are thus limited to table-level replicati on. This limitation can lead to serious support issues when some of the transaction operations are successfully replicated while others fail, leaving your database in an inconsistent -- and incorrect -- state.

There are several things to review when considering a replication approach for an application. Some applications simply are not viable candidates for replication. Frankly, if you can avoid a replicated approach, your life will be much simpler, but replic ation does provide a solution for several of the performance, availability, and mobility requirements of today's applications. Some considerations, such as data timeliness, are go/no-go level considerations. Others have a potentially significant impact o n how the application must be designed to operate within the constraints of the replication services. I discuss some of the more important issues in the next few sections.

Data Timeliness. Perhaps the first thing to think about when considering replication-oriented designs is the impact of data timeliness; that is, the age of the data at various sites along the application network. Applications that require up-to-th e-second data are not candidates for replication-based designs because the replication process takes time. Even if replication is "continuous," the data is going to be, at the very least, several seconds old, or, more likely, several minutes old, dependi ng on how many sites are being updated. Applications that can tolerate data that is anywhere from a few hours to a day old are much better candidates for replication.

Richness of the Table Partitioning Mechanism. Another important consideration when designing replicated applications is the richness of the replication service's distribution function. Most replication services provide some level of horizontal and vertical table partitioning. Horizontal partitioning involves restricting the rows being copied. Vertical partitioning determines which columns will be replicated to a specific replication target. Most of the currently available partitioning mechanisms are limited to operating within the confines of the replicated table itself; in other words, the partitioning must be defined in terms of a query that operates exclusively on the replicated table itself. This intra-table distribution level can be limitin g, and can have implications on your database design. For instance, intra-table limitation can force you to carry ownership information (primary keys of parent records) further down into the lower- level nodes of your database design than you would typic ally care to do.

A partitioning scheme that lets you join with other database tables provides much more flexibility and will not impose these subtle, but costly (in terms of database size and maintenance), database design issues on you. For instance, a mechanism that let s you plug values from one or more columns in the replicated table into a parameterized query that can reference any number of tables would be very powerful. Even more resilient mechanisms would include the notion of defining partitioning through a user- supplied function written in languages such as C, C++, and Visual Basic.

Another, broader issue in the data distribution or replication rules domain involves the overall orientation of the replication process itself. In many available replication services, replication is defined in terms of a set of replication "rules," which are specified for each table for each site. As each subscriber site connects, the set of replication rules stored for that node is run against all transactions that may be replication candidates for that node. If the transaction ends up being replicated to node A, some trace of the event must be left so that after all sites have picked up the transaction, it can be deleted from the replication log. Thus, replicated transactions cannot be purged from the replication log until all sites designated for sp ecific transactions have picked the transactions up. Plus, some replication tracking information must be updated each time a transaction is replicated. In the meantime, each time a subscriber connects, its replication rules will operate against a table c ontaining transactions for possibly hundreds of other subscribers.

In multiple-node applications, merely specifying and maintaining a set of rules for each remote user is a chore. Therefore, a mechanism should be in place that lets you reuse replication rules. Making a pass over the replication tables for each subscribe r and updating reference count information as the same transaction is replicated to each subscriber is additional repetitive overhead.

This large volume of rules could lead you to question the general orientation of the replication process. Is it really necessary to specify a separate set of rules for each node? Could you specify replication along table or transactional lines? In the SF A application, we have a large number of nodes to which data is replicated in a parametrically similar manner. That is, replication is frequently completed by region or group, but it is parameterized, rather than a hard-coded value such as:

zipcode = 30326
or
territory = 'BOB'
We find it convenient to specify replication along business object (frequently equivalent to table) lines. It is usually sufficient to specify one or two fairly powerful rules per business object.

In addition, it is much more efficient to implement more of a queued push-type architecture at the main publisher node. The server distribution process is run on a customer-specified interval. When it runs, each transaction in the replication log is revi ewed and copied into a transaction file for each node. The replication log can then be purged, and remote users simply pick up their transactions when they choose to call in. No server database cycles are consumed.

The Performance Impact

Invariably, adding replication to an application imposes some level of performance impact at every site involved in the replication process. Replication is typically provided using:
  1. transaction log "sniffing" or "scraping" (the change is determined by picking through the database transaction log); or
  2. database triggers and procedures record changes, or some other mechanism that captures DML statements and records those changes in a change file.
Of these two approaches, the database trigger and capture mechanism approach has more of an impact on real-time application performance, because database journaling is usually turned on anyway for any application with recovery requirements. The performan ce cost in the log-scraping architecture comes when the log must be scanned for replicatable changes.

When considering the log-scraping model, you should ask the replication vendors some key questions: Is the scraping performed at some periodic interval or when a target calls in to synch up? When are replicatable changes removed from the log file? Does t he log scraping impact when the log can be purged/reused?

Another less obvious impact on performance comes whenever a replication node connects to pick up and process changes. Typically, some sort of query must be performed to identify those changes applicable to the calling node. This consumes some amount of p rocessing cycles. Then, each change must be applied to the target site, which consumes even more cycles. If you are replicating a high number of sites, the performance hit could be very significant.

Heterogeneous Environments

As I mentioned earlier, you must account for the heterogeneity factor in today's computing world. The fundamental need in replication server environments is for some sort of standard protocol to transmit change transactions among systems. These transacti ons are typically relational database changes, so how about SQL? Well, we all know how standard that is. Therefore, ODBC might be a possibility for dealing with non-relational DBMSs and even file systems (albeit not very efficiently). What about OODBMSs, three-tier architectures, data independence, and all that? Actually, I believe that business objects can help to absorb some of these heterogeneity issues (I'm referring to generic business objects, not the company or product of the same name.) Our busi ness objects are defined at an abstraction level above the physical database schema, which enables us to map different schemas within the same business object model.

As I mentioned before, some vendors support heterogeneity through the use of gateways. However, gateways depend on the availability of the other machine at synchronization time, and I don't really see how such a scheme can be bidirectional. You can imagi ne how to implement the gateway approach in a snapshot scenario, but if there is an Oracle7 engine on the publisher side, and Sybase on the subscriber side, how does the Sybase side know how to send replication updates to the Oracle side?

Once you consider the heterogeneous world, you must also consider the heteromorphic world. Can you really expect the database schemas at all of the various replication sites to always be the same? Don't count on it. If you're lucky, you won't have any ta ble overlap, but you will have to deal with different table or column names. Or the units data type at one site may be different than the units data type at another site. Or data in one table at one site may be spread across two or more tables at another site. Once the data starts to span tables, bidirectional replication begins moving from difficult to impossible. Updates across multiple table are tough. Cross-table inserts and deletes are virtually impossible.

This is the old updatable view problem. If the tables involved in the replication are associated with only one view, you can solve the problem rather easily. But if the tables are associated with multiple views, or can stand independently in other contex ts, would it be correct to delete from ViewA all associated table entries? On the insert side, is it possible that one of the tables involved in the view could be instantiated independently? In that case, an insert on ViewA that gets broken down i nto multiple table inserts might fail on the insert into TableX, because the TableX component already exists. Should it be replaced or updated?

Some replication vendors provide mechanisms to deal with the heteromorphism issue. Sometimes, however, the mechanism is in the form of a third-party tool that intercepts and morphs the transaction at one of the exchange points, at the sender side, or at the receiver side. These strategies usually have their own limitations, however, and you should question the vendor fully before going with one of these solutions.

Next month, I will discuss more of the issues involved with managing mobile, replicated applications, including cross-platform complex transaction semantics (cascaded delete), data collision detection and handling mechanisms, and error notification and c orrection.

Issues Specific to Mobile Apps

In the course of describing the various issues involved with replicated applications, I have alluded to additional issues in mobile application environments. I'll discuss the major ones now.

The Communications Factor. Communications are at the heart of all mobile applications. Over the years we have found that by minimizing communications, we can provide higher performance, higher availability, and lower-cost systems. The reason is th at with modem-based communications come all those ugly little communications issues, including the modems themselves (configuration, installation, maintenance, and so on), cables, plugs, phone jacks, baud rates, line quality, line drop, and power fluctua tions.

Of course, the big thing now is wireless communication, which eliminates all of that cabling and manual intervention. But how well does your cellular phone work when you're in the middle of an average-size office building, perhaps near a big support beam ? Can you really imagine trying to conduct serious business with your data on that line quality?

Unfortunately, wireless simply does not provide the bandwidth needed for SFA applications. These applications involve tables that have several hundred thousand rows, of which a few thousand may be selected for a scrolling list box query. Delivering that much data at current transmission rates isn't practical. As this technology evolves, the transmission rates improve, compression gets denser, and client-side caching gets better, wireless technology may be viable for data-intensive applications such as S FA.

Communication costs can also be a significant factor. Even just maintaining a direct database-to-database connection during the database synchronization process can be costlier than you might suspect. First, even if the synchronization time is relatively short, when you scale it up by 1500 users per day, five days a week, it starts adding up. And, unfortunately, in the direct database-to-database connect synchronization model, there's a fairly good chance that the connect time may not be short, because that model directly depends on the availability of database server cycles at the publisher and subscriber sites. If one publisher site is servicing 1500 users (a significant portion of whom may connect at about the same time), you will be faced with a se rious capacity problem. If the capacity is not provided, machines are going to be hanging on the phone line for a long time, running up phone costs and increasing the probability of problems such as line drop, line quality loss, and so on. These p roblems can be exacerbated in the international market, where communication costs are typically higher and line quality is typically lower.

More than five years ago, we made a very conscious decision to move from a peer-to-peer, direct-connect synchronization type architecture to a more asynchronous, batch-oriented, post-communication architecture for many of the reasons I've cited. While th is approach introduces complexity into the replication service design, the market simply will not tolerate the long connect times and high communication costs that can arise when scaling the direct connect model up to 1500 users. The batch-oriented, post -communication design enables the server side to process transaction files as cycles become available, without requiring the subscriber to hang on the line. Transaction files are processed out of a chronologically ordered queue, where mechanisms are prov ided to ensure proper serialization of each user's transactions. The processing for 1500 users can be spread across a 12-hour to 18-hour time period at the server's discretion, rather than at the subscribers' demand.

The End-User Factor. When applying replication technology to mobile applications, you must account for the fact that the operator of the machine is an end user. There are no system administrators, DBAs, or operations personnel out there. These are end users who are not going to correct errors or perform backups. They probably won't even report errors unless the error is physically preventing them from doing their work. They are not going to reorganize anything. They are not going to purge anythin g, at least not in a way that would give you a warm, fuzzy feeling. However, they are going to be on the phone in a heartbeat if you lose or otherwise corrupt their data.

The impact of this factor on application design and replication vendor decisions is significant. At the application level, you must guarantee robust fault tolerance and restartability in both the application and the replication service. In some cases, it might be wise to lock the user out of the application in the event of certain kinds of failures. And, as I've already mentioned, you must determine -- in advance -- how replication errors will be reported and corrected.

A more subtle impact of the end-user issue is that you must design for the fact that you cannot force a user to connect at preset intervals. You can try to explain all the problems inherent in infrequent communication, but, in the end, you have no contro l over it and must account for it in your design. This has particular impact on any time-based collision detection mechanisms.

You must also consider how you will change, update, and maintain your remote applications. If the change or update is relatively minor, you might want to perform the change remotely. Companies don't like interrupting the activities of their mobile sales force, and asking them to send in machines for an upgrade is out of the question. Given the ability to send configuration changes to the field, your replication service and the application should be able to detect a configuration change and ensure that a ll the parties are operating at the same configuration level.

Conclusions

As with most new technologies, replication is being touted as the solution to a variety of computer-related headaches. And as with most new technologies, if you fully understand the technology and can match the technology to the application, it may provi de some needed relief. However, you must be aware that, by definition, replication involves dealing with several different databases at different locations across networks and communication lines, so there are many opportunities for problems and failures . Therefore, you must understand replication and the capabilities and limitations of specific vendor's replication services to make wise design choices.

Some vendors would have you believe that applying replication capabilities will, like a Harlequin romance novel, lead fairly painlessly to that happy ending you are seeking. While I believe it's more likely to be along the lines of a John Grisham thrille r, it doesn't have to be a Stephen King book.


Glenn Froemming has been in the computer industry for over 20 years. For the past eight years he has served primarily as a system architect, designer, and technical development manager. He currently works for Sales Technologies Inc. (Atlanta), a subsidia ry of Dun & Bradstreet, as chief architect and technical development manager for the company's Open Data Delivery System. You can reach Glenn via the Internet at 103147.374@compuserve.com.


TABLE 1. Replication Terminology

Asynchronous: In the context of replication, this is the ability to log SQL operations targeted for another site for later processing by that site. Note that this does NOT necessarily imply that the actual processing of the transactions will be as ynchronous when the subscriber finally connects to receive the changes.

Bidirectional: In this form of replication, both the publishing (originator) and subscribing (receiving) site can update a particular data object. This is also referred to as "symmetric."

Heterogeneous: This term implies replicating data among different DBMSs.

Heteromorphic: This term describes the ability to address and identify the same data element by different names (mapping) or in a different shape (different units, scale, display format, and so on).

Horizontal partitioning: This is a form of data distribution in which a subset of the replication data for a table is selected through the use of a replication query that restricts the set of rows that are replicated to a specific site.

Publisher: This is the originator of a replicated database change. Note that a publisher may also be a subscriber in bidirectional schemes.

Push replication: A publisher site controls when replication is to occur and "pushes" the changes out to the subscribers.

Pull replication: The subscriber sites determine when they wish to receive replication transactions.

Refresh: This is a process whereby one or more tables at one of the subscriber sites are completely restored or updated.

Snapshot: This is a single-updater form of replication where only the publishing site can update the data.

Subscriber: This is a receiver of a replicated database change. Note that a subscriber may also be a publisher in bidirectional schemes.

Symmetric: See bidirectional.

Vertical partitioning: This is a form of data distribution in which a subset of the replication data for a table is selected through the use of a replication query that will select only the columns of interest for a specific site.


Subscribe to DBMS and Internet Systems -- It's free for qualified readers in the United States
March 1996 Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS and Internet Systems (http://www.dbmsmag.com)
Copyright © 1996 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated Wednesday, November 6, 1996