
Everything you need to know about managing a web site for the long haul.
Most Web sites, for better or worse, have been set up quickly in response to some new customer demand. For Intranets, it's your own employees demanding easy access and a uniform interface to all of your internal systems, their health insurance information, their 401(k) information, and the list goes on. For Extranets, it's your suppliers demanding easy access and a uniform interface to supplies-ordering and shipping systems. And for the Internet, it's all of your customers and investors clamoring for annual reports, product ordering, and services online.
If you've recently deployed a Web site, you know the critical sense of urgency, the frustrations, and now, hopefully, the satisfaction of getting the job done fast. Unfortunately, the less-glamorous task of managing that Web site has just begun. If you haven't yet taken the time to establish organization, process, and technology strategies for the ongoing support of your site, you will quickly find that setting up the Web site was the easy part.
Now is the time to take a step back and survey what you've put in place. Can your organization, processes, and technology handle your current requirements? Can they handle the requirements you anticipate for the next six to 12 months? Now take a deep breath . . . Could they handle 10 times the usage 12 months from now? Could they handle a doubling overnight? Could they handle new media, new programming languages, new databases, and the pressure to innovate overnight when the hottest new tools come along? Internet technologies and the Web, by showing everyone with a connection what everyone else is doing, have made these scenarios much more likely, even on a humble 10-person Intranet. Even if you intend to limit the increase in traffic on your Web site, it would be much more pleasant for everyone involved if you didn't let limitations in your organization, process, or technology strategies do it for you.
This is a technical magazine, so I will concentrate here on the technical aspects of Web site construction, operations, and planning, although organization and process are at least as important in keeping a Web site going. As you'll see in this article, Web sites are in many ways just like any other software technology, and like database servers, transaction-processing (TP) monitors, and so on, they must be integrated into the fabric of your organization to be used effectively. Where managing Web sites differs from managing traditional client/server systems is in accessibility -- just about everyone has a TCP/IP connection, a browser, and can write HTML. This implies that more people will want to create content and applications, and more people will want to see and use the content and applications that are available. As a result, the biggest issue in managing a Web site well lies in managing scalability.
Many successful sites increase their traffic tenfold over a year, and a single new application or announcement can literally double the traffic overnight. Whether your Web sites are used to support Intranets, Extranets, the entire Internet, or all three, and whether you're running a single Web server or 100, being able to scale in step with growth of such magnitude will make the difference between success and costly, aggravating failure.
On top of the physical environment, the first layer of the software environment is the operating system and its networking components, which include networking hardware and firewalls. Here you need to consider strategies for scaling the performance and reliability of your Web site. Some hardware and operating system solutions will lead you to move toward fewer larger machines with high-reliability disk arrays, and other solutions will lead to several smaller, redundant machines. Both directions have their proponents, but most sites will opt for a combination -- larger machines for more complex applications, and smaller machines for either niche applications or as interchangeable front-end processors for the larger machines.
The is no single standard solution for building high-reliability, load-balanced Web sites. The closest thing to a standard solution available is round-robin Domain Name Service (DNS). This maps a single hostname to multiple IP addresses in your DNS configuration; as requests come in, the name server returns each of the IP addresses in turn. Although this approach helps spread out the load across multiple machines, it can't route traffic to the machine with the lowest load, nor can it automatically stop routing to a machine that's down, and it adds to the overhead of DNS by requiring a low time-to-live value on the hostname. You can address some of the shortcomings of the round-robin DNS approach with Roland Schemers's load-balancing name server daemon, lbnamed (www-leland.stanford.edu/~schemers/docs/lbnamed/lbnamed.html). Others prefer to go down a level to manipulate routing protocols -- see Lucent Technologies Inc.'s WWW6 paper, at www6.nttlabs.com/HyperNews/get/PAPER196.html -- or to simply purchase a proprietary solution from Cisco Systems Inc. (LocalDirector), IBM Corp. (Interactive Network Dispatcher), or others.
Your choice of operating systems and supporting servers (databases, ftp, mail, news, chat, and so on) raises similar challenges. Each choice at this level may lead to completely different classes of solutions for your applications. Your best bet is to research the market on the Web and in print and then evaluate the contenders yourself in your own environment. Good load-testing software is indispensable here. At the low end, you can use your own scripts wrapped around freeware such as Silicon Graphics Inc.'s WebStone (www.sgi.com/Products/WebFORCE/WebStone). At the high end, many of the providers of client/server load-testing tools have adapted their software to test Web servers (for example, Mercury Interactive Corp.'s Astra SiteTest and PureAtria's Performix).
For Web servers, public domain servers (Apache) and servers based on their code (Stronghold) still lead in market share for the simple reason that they are as good as or better than most of the commercial products available. Also, in a market that is developing as quickly as the Web, support for the latest standards and innovations usually comes first to servers in the public domain. The leading commercial servers from Netscape offer what you could call a one-size-fits-all, please-use-our-proprietary-extensions option. Microsoft's offering is much the same, though limited to Windows NT. More specialized servers such as OpenMarket Inc.'s Secure WebServer can often be more flexible for tuning to your specific application -- and because of their smaller market share, these servers are more likely to use more standard, less confining extensions. For example, Fast-CGI, which enables you to run CGI programs without the overhead of starting up and shutting down processes for each request, has been adopted by several of the smaller vendors. You'll probably get better customer service, too.
As soon as you have more than a handful of content developers, you'll need some form of version-control software to keep people from clobbering each other's changes. In many cases, a simple version-control tool coupled with some link-checking software for your HTML files will be enough. If you take this route, be sure that whatever tool you choose (public domain RCS and CVS, Intersolv Inc.'s PVCS, or Microsoft's SourceSafe) is easy for your content developers to use. Graphic artists and writers accustomed to Macintosh and Windows tools will balk at a Unix command-line interface. The natural interface for this application is, of course, a Web interface, and many solutions are beginning to use this. Commercial link-checking software (such as NetCarta, now owned by Microsoft) not only verifies that your links point to valid destinations but also usually gives you a bird's-eye view of all of your site content. Commercial link-checking software can make it easy to see the overall structure of your site and may point out weaknesses in its layout or design that are difficult to spot when you view your site one file at a time.
An important distinction to be aware of in your content development is among static content, dynamic content, and software development. Static content is any content that looks the same every time a user sees it. HTML files, .gif images, and MPEG video all fall into this category. Dynamic content can range from HTML files with server-side includes (mechanisms that let you put calls to external programs inside your HTML files; the server will parse the file and run the programs when the file is requested) for the current date and time to content that is completely software-generated. Finally, software development would include software that end users see, such as applets, as well as software that executes on the server in the background, such as order-entry systems. Most self-described "site-management" tools handle only static content well.
Static content scales well if you put style guidelines into place to encourage consistent navigation and look and feel on the site and if you can distribute the content development to the people who know the subject matter of the content. If you throw dynamic content and software into the mix, however, coordination and communication challenges will make scaling more difficult. All three types of content must not only coexist but interact with each other, and they often must work as a single system. If you have large numbers of software components on your Web site, you should manage the site as you would a software product, with HTML as the user interface. Java or ActiveX application interfaces can fall almost entirely into your usual software development practices, with the addition of some navigation to and from the application on your Web site.
To maintain a consistent look and feel between your software and the static content on your site, it is important to provide static content developers with tools to develop and maintain your software's interface. Again, there are no standards here, although most products use some form of specially formatted HTML comments to accomplish this. Future versions of standards may incorporate this practice.
While discussing software development on a Web site, the issue of what languages and tools to use will undoubtedly arise. From a maintenance perspective, the best approach would be to remain well within the bounds of server-independent approaches such as CGI and, to some extent, Fast-CGI. Web server interfaces are likely to change rapidly over the next few years, and trying to maintain a stable software product on a shifting foundation is next to impossible. If you must take advantage of the improved performance of proprietary server interfaces, you are better off building a single gateway routine that interfaces directly with the server and then making the rest of your code as server-neutral as possible. Most commercial Web server add-on products do.
By far the most mature of these add-on products are in the Web server-to-database server connectivity market. The basic idea behind these products is to reduce the overhead of repeatedly starting up CGI programs and of repeatedly reconnecting to your database for every HTTP session. You can think of these add-ons as extremely simplified TP monitors. Most of them handle only a single database connection per process (or thread), and because of the vagaries of HTTP, they can only guarantee transactional integrity from the database to the Web server, and not all the way to the Web browser client.
Second, monitor your site constantly. Not only will this help you spot problems as soon as they arise, but it will also help you anticipate the need for more capacity. At this level, you should be monitoring the load on your servers, the network traffic on your site, and if possible, the network traffic to and from your clients. For public Internet sites, it can be difficult to monitor the responsiveness of the network all the way to users' Web browsers, but it's a requirement if you want to ensure quality service. Some Web site log-analysis tools (such as Accrue Software Inc.'s Insight) provide this information for HTTP sessions, and other vendors (such as net. Genesis Corp.) offer a service that polls your site from multiple locations and then reports on its responsiveness. Being aware of bottlenecks across the Internet as they occur can help you decide which ISP to use and where you should connect to the Internet to get the best performance no matter what happens on the network.
After your content is deployed, you, every content developer, and each of your managers will want to know how successful your site is. Thankfully, Web servers all create log files that contain most of the information you will want, and many tools, from free to tens of thousands of dollars, can help you analyze them. (See the accompanying article on page S12 for further discussion of log analysis.) Unfortunately, almost all of the log-analysis tools available will become unusable once your log files reach a certain size, and they may not be able to handle any customized data that you may want to capture from your applications. For example, a tool that parses log files will take longer to run the more log file data you have. Eventually it will take longer to parse the data than it takes your Web server to create it, and your analysis will fall farther and farther behind. Again, whatever tool you choose, keep in mind your scalability requirements and test the tool yourself with your own hardware and log files. Also, consider what data you need in real time, what data can wait a day or more, and when data can be removed from your log file archives.
What did you think of this article? Send a letter to the editor.