Internet Systems

Log Analysis

By Dan Rahmel
Internet Systems, July 1997

Using Web server log files to make an objective and useful analysis of your site.


Now that Web servers are providing a way for database information to be published easily, convenient user access to important resources is of vital interest to all database administrators (DBAs). To this end, tracking what data is accessed on a Web site can provide invaluable clues as to how to better present that data. Client/server database applications written in PowerBuilder, Visual Basic, or most other application development tools have never automatically maintained similar audit trails documenting what a user did inside of an application. In this way, Web server logs represent a new resource for administrators and developers.

Publishing information on the Web can be expensive in terms of both upkeep and squandered effort. If information that is vital to your organization cannot be easily accessed or found, the effort expended on collecting and publishing this data is wasted. Worse yet, the people who most need immediate access might be either unaware of its existence or unable to locate it in a reasonable period of time.

Simplifying the structure of the Web site and providing intuitive query tools are therefore of paramount interest to a DBA seeking to provide information both within an organization (an Intranet) and to those outside (an Internet). Asking your users how they use the site and what information they seek is essential but may not provide an objective reflection of the site's use. You might receive positive remarks from users about the query front end, but they may have difficulty applying it to daily tasks. Without usage information, a large portion of functionality may remain unutilized.

A Web server log file can provide almost all of the information you need to analyze your site effectively. Web servers store information on all accesses to a Web site, most commonly referred to as "hits." The Web log file can be summarized to present a clear picture of the site usage. By analyzing the patterns of access, you can accurately judge not only the information that is accessed, but also the pages that are rarely or never examined.

If a page contains crucial information that is never even examined, the links or leaders to that page can be moved to a more dominant position. For DBAs, knowing the patterns of usage can enable the optimization of commonly needed data and queries as well as dedication of resources to the most common tasks.

Scores of tools are now available on the market to examine log files automatically and provide detailed summary and analysis reports. If these tools do not completely satisfy your information needs, you can also build a custom analysis tool or use ad hoc queries to supplement your site evaluation. Accurately understanding the information stored within the log file will help you better understand how the summaries are compiled and what other information is stored to your server.

The Web Log Format

The first step toward turning your Web server log file into a resource is determining how it is stored and what information is contained within the log. Log information may be stored either as a text file or as individual records in a database. Because storage to a file provides the most flexibility and portability, most Web servers default to using a text file. The format of the file can vary depending on the vendor, but most support the standard Common Log format either natively or through a conversion utility.

The NCSA has created a standardized log format that provides a minimum amount of information that details each client request for data. Known as the Common Log Format, the log file stores information such as the IP number of the person accessing the site and six other fields. These fields provide a baseline of information detailing the client's visit to the Web server.

Each line in the log shows an individual request from the client. The name, RFC name, and log name fields record the identify of the requester. Unlike traditional database access, users do not log in for a session and later log out. They may visit the site and in the next second be off to another site, with the Web server unable to know that they have left. Using the user fields in conjunction with the date/time stamp across multiple log records, you can approximate the analysis of a user session.

Although the Common Log format provides a baseline of information, most Web server vendors extend this common format to include more precise information on access. Many Web servers add a few more fields in order to parse the information more easily. Microsoft Internet Information Server, for example, adds extra fields such as the ServerIP (in case the server uses multihoming) and processing time to determine the efficiency of the server and the connection.

Beyond server log file extensions, there are also tools (such as a program called SiteSpy by Reportech Inc [www.reportech.com]) that extend the logged information to include variables available from the client browser. The three most common additions to the log file information are the referring site, the browser type, and the current cookie. These three pieces of information are stored in the client's browser and can be retrieved by a specific HTTP request.

The referring site parameter shows what link was used to reach a site. This parameter may also be used in conjunction with a user session to track a user's path through your site. Browser type lets you determine what browser programs are being used by your clients. Cookies can be set to store anything from user IDs to custom settings.

To make these types of additions to the log files for all HTTP requests, the server itself must be modified. This can be done by actual server recompilation or through a plug-in capability supplied with the server. For Microsoft Corp.'s IIS, these modifications can occur through the ISAPI programming interface.

Storing Log Information

Although the log file may be stored in a file, the most flexible storage solution is to store the individual records in a database. Most Windows NT-based Web servers will allow the data to be stored in any ODBC data source. The information is written into various predefined fields.

The advantage of storing the log in this manner is that it lets you use indexing and querying to search the log data in an ad hoc manner. Rather than being limited to the features provided by a Web analysis package, you can use SQL queries to collate and compile any information you want. You can also use standard reporting tools (including HTML files constructed from the report results) to display summaries.

Before you set your Web server to write the usage log entries into a database, be prepared for the additional performance penalty of such a system. Most Web servers can quickly append entries of Web requests to a file. However, if logged to a database, the database connector may consume system resources and processor time.

Another consideration in your choice of file or database logging is how you intend to use the log entries. For most Web servers, how frequently the Web log is written to the disk varies depending on the implementation. A database log storage will often insert entries immediately, but many servers will cache changes to a file log and only write the changes at specific intervals. If you rely on your Web log for realtime server monitoring, a database store may be your only option.

Information Summarized from the Web Log

Numerous resources on the Web explain how to use the information provided by a Web log for Internet marketing and promotion. However, a DBA's needs are usually slightly different. The majority of the analysis for administrators will be to determine the usage on an Intranet. Therefore, traditional market analysis tools may have gaping holes in terms of the summary information required by DBAs. By analyzing the Web log, you can determine the resources to devote to particular types of information. Because you're probably dealing with many internal users, scrutiny of the Web log itself (possibly on a sampling basis if the log file is very large) rather than exclusively using summary data might help you best understand the needs of your users.

Most Web tools break the data down to provide peak usage time, user locations, most used pages, and so on. Pay particular attention to the most commonly used pages. Are these database query pages? If so, they could be taking up a large percentage of the resources on your server. The log files for many servers also include the amount of time required to respond to a particular request. By paying attention to this number, you can determine if users are waiting an excessively long time for the return of a particular query.

The log can also provide crucial information on the extent of navigation by common users. If many users never go past your home page or never visit your search page, there may well be something wrong with the presentation. A Web site's main purpose is to provide simple, accessible information. If navigation of your site is too difficult, you can make modifications to ease its use. The log file can then be used as a comparative tracker to determine whether the changes had a positive effect on how the site was used.

The Web log can also help you determine what browser was used to access your site. Although this data might not seem important, it can be crucial in determining how data on the site should be presented. For example, you might have a corporate Intranet that provides quality-control information, and company policy specifies that Netscape Communication Corp.'s Navigator 3.0 should be used by everyone within the company. Checking your Web log might reveal that half of the people accessing the corporate site are using Navigator 2.1, which lacks a key feature (support for Java or a Java Virtual Machine, for example) that you use in your pages for data presentation. From this information, you can either help users upgrade their browsers or change your content to be more easily viewed by these users.

By examining data access statistics, you will also have a roadmap for the creation of a data warehouse. Your data warehouse should most accurately reflect the information deemed essential by managers and administrators. Looking for the most accessed data, like examining a well-worn path through a forest, will indicate where most managers need to go.

Basic Analysis is Just the Beginning

Log files are an excellent starting point for monitoring your site. However, DBAs face special problems. Most Web access to a database occurs through a database connector. HTML pages are dynamically built by a piece of middleware via a SQL query. This might consist of a CGI application, a Perl script, an Active Server Page (ASP), or a custom connector.

Therefore, log analysis tools may not provide the information you need because the most commonly accessed data will not appear directly as a page that was accessed, but instead as a query that contains various parameters. Your SQL query might be passed through Web parameters or be stored on a server-side HTML constructor template. In the Web log, this information will appear as a reference to a particular execution (such as a CGI call or an ASP query), but all instances will be lumped together as a single page access.

There are a few ways that DBAs can determine the most frequently queried information in order to optimize presentation of these results. The simplest method is to use the tools supplied with your particular database to determine the most common queries and data usage. This method is often very effective because the server-side database connector accesses a database through a single user login. If the database connector, for example, uses an ODBC data source, it usually requires a specific account such as Webuser that is used to do all Web queries. Tracking queries that are denoted to this particular user will reveal what queries have occurred.

Another method is to use commercial software, which breaks down the log file based on additional query parameters. For example, a typical URL to perform a search might appear as "www.mycompany.com/ dbconnect.cgi?argument1=1&argument2=cartype." If your Web analysis tool can extract and sort the parameters passed to your database connector, common queries can be determined.

The most potentially useful -- but also the most expensive -- method is to build a custom parsing application. Because of the diverse formats of log files and the variety of database connectors (which differ on the formats for passing parameters), a custom tool might be most effective for site analysis. If your database connector uses a type of custom page on the server to generate the final HTML files (as Allaire Corp.'s Cold Fusion and Microsoft Active Server Pages [ASPs] do), you can use the log file to cross-reference the template files to retrieve the actual queries and data used. This tool might either read a log file or search a database that stores log entries.

To the advantage of our readers, I created and DBMS sponsored a free example tool that examines a log file for references to ASPs and summarizes the query information. I constructed this tool using Microsoft Visual Basic 5. On the Client/Server Central Web site (www.coherentdata.com/cscentral), I posted both the finished application and the source code to help you build your own custom tool. I also posted an article on the tool's construction, so if you're using a tool besides VB, you may find the explanation instructive for creating a tool in your own development environment.

I highly recommend that you use a commercial or shareware analysis tool whenever possible. Creating a custom solution is expensive and time-consuming. C|net Inc.'s Shareware.com (www.shareware.com) has a number of free log analysis tools. Some of these tools may contain the features you need.


Dan Rahmel has 12 years of professional experience in design, programming, implementation, consulting, and writing about the computer industry. He currently works for Coherent Data, a service provider of secure commerce Web sites. He is also coauthor of the book Developing Client/Server Applications with Visual Basic (SAMS Publishing, 1996). You can email Dan at cvisual@electriciti.com or check out Client/Server Central (www.coherentdata.com/cscentral) for resource information.
* Aquas Inc., Sunnyvale, CA; 408-737-7122 or fax 408-737-1292; www.aquas.com. Bazaar Analyzer Pro
* BienLogic Inc., La Jolla, CA; 619-551-4888 or fax 619-551-4890; www.bienlogic.com. SurfReport
* e.g. software Inc., Portland, OR; 503-294-7025 or fax 503-294-7130; www.egsoftware.com; www.webtrends.com. WebTrends
* EveryWare Development Corp., Mississauga, Ontario, Canada; 888-819-2500, 905-819-1173, or fax 905-819-1172; www.everyware.com. Bolero
* Marketwave L.L.C., Seattle, WA; 800-521-8176, 206-682-6801, or fax 206-682-6805; www.marketwave.com. Hit List Pro
* Microsoft Corp., Redmond, WA; 800-426-9400, 206-882-8080, or fax 206-936-7329; www.microsoft.com or www.interse.com. Intersé Market Focus
* net.Genesis Corp., Cambridge, MA; 617-577-9800 or fax 617-577-9850; www.netgen.com. net.Analysis pro
* Open Market Inc., Cambridge, MA; 888-673-6658, 617-949-7000, or fax 617-621-1703; www.openmarket.com. WebReporter
* Reportech Inc., Bellevue, WA; 425-644-9646 or fax 425-644-9974; www.reportech.com. Site Spy
* WebManage Technologies Inc., White Plains, NY; 914-697-7555 or fax 914-697-7556; www.webmanage.com.NetIntellect

What did you think of this article? Send a letter to the editor.


Subscribe to DBMS and Internet Systems -- It's free for qualified readers in the United States
July 1997 Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS and Internet Systems (http://www.dbmsmag.com)
Copyright © 1997 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated Wednesday, June 18, 1997.