DBMS, January 1996
DBMS Online: Desktop DBMS By Tom Spitzer

Needles in Document Haystacks

Text Retrieval and Management Technologies Come of Age.

I have a problem. Over the years, I have collected thousands of documents, all of which take up hundreds of megabytes of disk space on my computers and on the LAN to which I'm usually connected. My collection is growing dramatically as I surf the Internet and retrieve information on an ever expanding universe of topics, products, and technologies. I am able to find anything only because I trained myself to function as a human topic catalog early in life. When I was a student, I fancied myself a journalist and collected all manner of scraps of information. Before starting to collect information on disk, I already had sizable collections on paper, which was searchable only manually. Alas, as I enter middle age, I find that the breadth and depth of the information I and my colleagues are collecting is exceeding the capacity of my internal catalog system. We need tools.

There are several interesting facets to this problem. For one thing, it looks like a database problem, although the data fails to reveal itself in a particularly regular structure. Aside from its irregularity, a big problem with unstructured information is that it has very low information density. The facts I may need are dispersed and distributed throughout the material. Also, my documents are in a variety of formats, reflecting the document-creation tools that my company has used during the last decade. The distribution of documents across multiple file servers and Internet servers presents another challenge. Finally, I dream of a way to create a local index of interesting Internet sites that would enable me to return later when I need to see their contents. This would minimize the amount of document retrieval and local storage that I need to support my forays into cyberspace.

I realize that my problem is small compared to organizations, which retain much more information than I. My collection of information represents only a fraction of what my company has acquired in its 15-year life span. By most accounts, we are a relatively small company. Around the world, while billions of records exist in structured, field-oriented databases, the vast majority of corporate information is still created and stored in documents. The Gartner Group estimates that more than 90 percent of corporate intellectual property exists as text in various document formats. This corporate information base is a critical resource for knowledge workers, but cannot be accessed by most DBMSs.

Full-Text Retrieval Systems

Where there are problems, there are usually candidate solutions, and I set out to find them. I discovered a highly competitive marketplace of full-text retrieval products. Full-text retrieval systems are designed to work well with large volumes of documents containing unstructured text and associated diagrams and images. An effective text-retrieval system provides the ability to store, manage, and subsequently retrieve textual information. The stored information can consist of complete documents, or it can be abstracts/bibliographies, email, notes, keywords, names, titles, and so on. Increasingly, these systems support multimedia information such as graphics, images, sound, and video as well. Effective systems allow users to access specific information without having preexisting knowledge of the content of the information base. More advanced text-retrieval systems allow non-expert users to undertake searches in a natural, instinctive way.

A full-text retrieval system should allow you to enter data in an unstructured (free) format, as is typical in books, memos, and other text documents. Users must also be able to search and retrieve information based on the words and concepts in the documents, in conjunction with information recorded about the documents in structured fields. For example, a magazine article would also have an author name and a publication date. To fulfill these requirements, the core of these systems consists of very sophisticated indexing engines. It's interesting and a little amusing to find that many products claim to incorporate the most sophisticated indexing system on the planet.

To determine how well a document fulfills a query, search engines measure the number of occurrences of search terms, the relative density of search terms in a document versus their density in all the documents in a search, the proximity of multiple search terms, and the location of search terms within a document. All of the products seem to have their own weighting algorithms. Some incorporate a user-modifiable thesaurus, which enables a search to find all documents that include the specified words and any words with the same meaning. Some products incorporate sophisticated stemming techniques for finding all occurrences of words with a common root. Xerox produces specialized lexical software for stemming, which other software vendors license. Probably the hottest technology in the field is concept or topic engines that allow users to specify concepts or topics, which the engines convert into word patterns. Most concept engines require user-defined concepts, which you can think of as saved queries. Newer technologies are being developed that allow concept searching without requiring you to first define concepts.

The products I discovered fell into several categories. I will describe two varieties this month, and leave some interesting emerging technologies for future consideration. The first category of products consists largely of programming APIs and indexing engines that developers can use to integrate document search and retrieval capabilities into corporate and commercial applications. The second category consists of full software products that provide the indexing and search engines as well as an environment in which the user works. Within each category, there are several variations. All of the vendors recognize the emergence of the Internet as a worldwide network of information resources and are developing products for Internet applications.

Verity

I researched this column almost entirely over the Internet. One of the first companies that I learned about was Verity Inc. Verity develops both tools and applications that enable intelligent search, filtering, and dissemination of textual information residing on enterprise networks, online services, the Internet, CD-ROM and other electronic media. Verity's Topic family of products includes Topic Enterprise Server, Topic Internet Server, Topic Client, Topic Agent Server, and Topic Developers Kit. Together they provide facilities for search, retrieval, and categorization of archived textual information, as well as real-time monitoring and filtering of incoming information such as news, alerts, or documents.

The Enterprise server indexes documents -- regardless of their location or format -- into a searchable Topic index, or "collection." The Topic Client is an attractive document navigation and browsing application whose key differentiating feature is agent technology. Topic Client provides "intelligent agents" that act on the user's behalf to search and retrieve specific types of information. Topic agents can sift through historic data (searcher agents), or you can deploy them as "watcher agents" looking for relevant information as it arrives. For example, Topic agents can watch data from a financial newswire (when coupled with the Topic News Server) to find information related to a user's stock portfolio. Investors can get timely results that help determine when to make portfolio decisions. The Topic Internet server adds the Topic engine's sophisticated full text indexing, search, and retrieval services to an existing Web site. These tools allow users to conduct personalized searches across Topic-indexed information stored within multiple sources and formats.

With the Topic engine, documents are not modified or altered for indexing. Documents are stored in their original locations and their native formats -- always ready for retrieval, deletion, or modification upon request. The Topic engine supports several data formats, including ASCII, Adobe Acrobat's PDF, SGML, HTML, and more than 50 word processing, spreadsheet, and desktop publishing formats. The Developers Kit offers a suite of optional database gateways that provide access to documents stored in Oracle, Sybase, or ODBC-compliant databases.

I spoke to Michael Williams, Verity's product manager for the Internet Server product. He demonstrated the company's indexing and search engine by indexing my online collection of articles, and creating a search form for me to use to search them. The search form was an HTML page on Verity's server at http://www.verity.com. (See Figure 1.) The form was set up so that when I clicked the Search button, it would run a CGI script to call a program that searched the index of the articles at http://www.database.org. From the results, it dynamically generated an HTML page listing the matching articles, with hyperlinks pointing to them, so that when I selected one from the list, it would retrieve and display the full text of the article.

Verity is only seven years old, and was launched in large part to respond to the information management demands of both governmental and private intelligence organizations. That makes a great deal of sense. Such organizations have been collecting documents for hundreds of years. Recently the company has had substantial success licensing the Topic technology to prominent providers of document management and Internet products as well as to online services, including Adobe Systems Inc., IBM/Lotus (which incorporates it into Notes), Netscape Communications Corp., Quarterdeck Corp., and MCI's Delphi Internet.

Fulcrum

In a similar vein, Fulcrum Technologies provides tools for building custom applications for managing text-oriented organizational information. Fulcrum SearchServer features a scalable, client/server distributed processing architecture. Developers access SearchServer through interfaces such as SQL and ODBC, which the company hopes will enhance the product's long-term compatibility and interoperability with other system components. Independent and corporate developers can build custom applications to deliver SearchServer's advanced text searching and retrieval capabilities to their users. Such applications use a server API to talk to SearchServer with Fulcrum SearchSQL (SQL with text-retrieval extensions), SearchSQL statement processing, and results processing. Like Microsoft's ODBC, Fulcrum's API adopts the SQL Access Group and X/Open Call Level Interface (CLI) standards. For Windows, the SearchServer API is packaged as an ODBC driver, supporting the ODBC standard method of accessing data across multiple, heterogeneous data sources.

SearchServer organizes and references textual information through table structures. Each row in a SearchServer table corresponds to a document (or text object). You can optionally define columns in the table schema to record additional information not stored in the document itself. Unlike traditional database systems, you do not have to store the document text physically in the SearchServer table; SearchServer's text reader architecture allows organizations to keep documents in their existing locations in their original formats, such as Microsoft Word or WordPerfect files. Text objects stored in a relational database may be searched and retrieved transparently through SearchServer's database text reader architecture.

SearchServer builds a comprehensive index of terms to provide high-performance searching. During the search process, SearchServer accesses only the index files; the actual documents are accessed only for display and indexing purposes. SearchServer supports incremental updating of its indexes. It gives you the option of updating the index in batch mode or whenever the table is modified.

The Fulcrum SearchBuilder products are toolkits for Visual Basic and PowerBuilder. These kits include custom visual controls and APIs for connecting to SearchServer and application templates. These toolkits enable developers to rapidly prototype, develop and deploy SearchServer-based custom applications using the same development tools they use for their more structured database applications. You can also connect to SearchServer via any ODBC-enabled application to build and maintain text-centric applications using off-the-shelf third-party tools.

PC-Grown Products

While the Verity and Fulcrum products are coming to the PC platform from Unix client/server backgrounds, several products come to text management from a PC background. Included in this group are some fairly venerable products, such as askSam and ZyIndex, both of which have had recent upgrades to enhance their usefulness in a Windows-based, Internet-oriented world. Also in this category is a product called Concordance, from Dataflight Software.

A "concordance" is an alphabetical index of all the words in a text or group of texts, showing every contextual occurrence of a word. With the Concordance software product, you create databases of documentary information and load them with documents, fragments of documents, or keywords with pointers to documents. Concordance tables are much like the tables you would create in a typical database product, with the added wrinkle that they provide a paragraph data type into which the database builder loads the textual data. After the database is loaded with documents, the index function builds a concordance of all the words stored in paragraph fields. When searching a Concordance database, you can create queries that combine field elements (for example, documents created after January 1, 1995) with text. Concordance has 20 search operators that are grouped into four functional areas: context, proximity, Boolean, and relational. The proximity operators adj and near allow the searcher to find words that occur up to 99 words apart. The Boolean operators include the operator xor , which is an exclusive-or operator that locates documents that include one but not both of the search terms.

ZyIndex, introduced in 1983, was the first PC-based full-text retrieval system. ZyIndex locates documents based on content, and provides an extensive toolkit for search customization. These include Word, Phrase, Boolean operators, Proximity, Quorum, Numeric or Date Range, Separators, and Wildcards. ZyIndex's progressive search feature lets you start a new search based on the results of the previous one. (See Figure 2.) Like Concordance, ZyIndex supports definition of data fields and compound searches based on a combination of fields and full-text information.

ZyIndex turns every word in your documents into keys that enable content-based retrieval. ZyIndex gives you search results within seconds, which can be ranked for relevancy based on user specifications. It leaves document files in place and searches through them in their native formats, with a wide range of supported formats, including WordPerfect, Microsoft Word, Lotus 1-2-3, Excel, and even PKZIP. After using ZyIndex to find a file, the user can open the file with the application program used to create it. In addition to making it convenient to make immediate modifications to source documents without ending a search session, ZyIndex provides an embedded hyperlinking feature for adding remarks, verbal notes, and video files to a source document without editing the actual text of that document. ZyIndex will replay any such annotations during subsequent viewing sessions.

ZyLab, the publisher of ZyIndex, is trying to catch the Internet wave with its ZyIndex for Internet product. ZyIndex for Internet is a Web server that runs on Microsoft Windows NT. When connected to the Internet, it lets users access full text indexes created with ZyImage or ZyIndex, using standard Web browsers. ZyIndex for Internet automatically generates HTML based on the user's query and related document content. Once the home page and a few simple templates are set up, Internet providers are spared from writing HTML. ZyIndex for Internet significantly speeds-up Web searching. Transactions are performed on a single index instead of on a series of sequential files. Selected information is converted to HTML in real time and passed to the client Web browser in the form of a list of relevant documents with pointers to the source files. This reduces bandwidth requirements for the client as well as download time.

askSam

Like ZyIndex, askSam has been around for a few years and has just released version 3.0 of its flagship product, a Windows version with expanded support for storing information collected from online services. An askSam user would create subject databases that combine both fixed-length fields and unstructured text components, and populate them by importing the documents created and collected from a variety of sources. The user could then search the database for combinations of words, words in proximity to one another, dates, or numbers. askSam also provides templates for importing email and forum messages from CompuServe filing cabinets into textual databases, an import filter for Lexis/Nexis stories, and the ability to use the product as both an HTML reader and authoring tool.

askSam has the most extensive feature set of the desktop products, offering many character and page formatting options, authoring tools, entry form and report design tools, mail merge, database statistics, and graphics file support. These features are all packaged in an attractive and fairly intuitive Windows product. I have used it for a while to manage my electronic correspondence database. Building a list of the 50 or so messages (out of more than 3000) that contain the string "DBMS" in either their subject or their body is virtually instantaneous. The list provides hyperlinks back to the source message so that I can immediately see the contents of any one of the matching messages. My only frustration with askSam now is that it does not interact more fully with a Web browser. When I click on a URL in askSam it displays the URL details in a dialog, which I can copy and paste into my browser (I'm using both Netscape and Microsoft Internet Explorer). It ought to save me the step by handing the URL off to the browser, or supplying an embedded browser of its own. On the receiving end, I have to save the site I'm browsing and then switch to askSam to import it. I'd really like to see this work as a one-step process. For instance, askSam could provide a utility that hooks onto the browser and appends the Web site I'm browsing into an askSam database.

Future Tools

The increasing interest in text management has led to the entry of several high-tech software startups; I'll mention two briefly. The first is Architext Software. This is the brainchild of six recent Stanford computer science graduates who became frustrated with the current crop of text information management tools. These guys believed that: 1) most people searching for information don't know exactly what they are looking for, and 2) those who do don't know what keywords to use to find relevant documents. Their mission is to develop better retrieval tools and a suite of interactive browsing and hyperlinking tools that will help people navigate unfamiliar collections of data and quickly focus on interesting sources of information. Architext has attracted high-profile media attention as well as the interest of some of Silicon Valley's leading venture capitalists. I am currently using Architext's Internet search engine, called the Excite NetSearch (http://www.excite.com). Architext's concept search facility gives me the most relevant set of references of all the search engines I have tried. (See Figure 3.)

Another recent entry is LivePage from a company called Inforium, the Information Atrium Inc. Some well-known computer scientists at the University of Waterloo, Ontario, including a cofounder of Watcom and a former CEO of Waterloo Maple Software, came together to build a system of open, non-proprietary text and information management software products. They built their system on SGML and SQL standards; it stores SGML documents in popular relational DBMSs such as Oracle, Watcom, Sybase SQL Server, or Microsoft SQL Server. You can also use LivePage Tools to build custom database solutions with Visual Basic, PowerBuilder, and C. In the future, a standards-based approach that takes advantage of existing DBMS products and tools will make a lot of sense to organizations looking to implement text-retrieval applications.

An Evaluation Guide

I haven't mentioned all of the products that provide text-retrieval functions; however, my survey managed to uncover the key features and capabilities of these products and the directions in which their vendors are heading. As my opening comments implied, the ability to create indexes across diverse file formats and locations is important. Toolkits that let me build custom applications that integrate text-search functions into more traditional data management environments could be very useful. As the use of Web browsers for accessing information grows, the ability to search across Internet sites and format results into HTML documents will also be required. Finally, topic or concept search features may become far more prominent than the traditional formulation of search criteria as word combinations and operators. The Architext engine is just one example. I don't understand exactly how Architext maps my concept to an underlying word list, but I can see that this powerful search technology will be the foundation for the next generation of search engines.


Tom Spitzer is managing consultant for application solutions in the San Francisco office of AmeriData Consulting. You can email Tom at tspitzer@ameridata.com.
* Architext Software, 2700 Garcia, Ste. 300, Mountain View, CA 94043; 415-934-3611 or fax 415-934-3610; Internet mail: info@atext.com; Excite NetSearch Web site: http://www.excite.com.
* askSam Systems, P.O. Box 1428, Perry, FL 32347; 800-800-1997 or fax 904-584-7481; Internet mail: emma@asksam.com; Web site: http://www.asksam.com.
* Dataflight Software Inc., 2337 Roscomare Rd., Ste. 11, Los Angeles, CA 90077; 310-471-3414 or fax 310-471-5294; Internet mail: info@dataflight.com; Web site: http://www.dataflight.com.
* Fulcrum Technologies Inc., 785 Carling Ave., Ottawa, Ontario, Canada K1S 5H4; 613-238-1761 or fax 613-238-7695; Web site: http://www.fulcrum.com.
* The Information Atrium Inc., 158 University Ave. West, Waterloo, Ontario, Canada N2L 3E9; 519-885-2181 or fax 519-746-7362; Internet mail: info@inforium.com; Web site: http://www.inforium.com.
* Verity Inc., 1550 Plymouth St., Mountain View, CA 94043; 415-960-7600 or fax 415-960-7698; Internet mail: info@verity.com; Web site: http://www.verity.com.
* ZyLab Corp., 19650 Club House Rd., Ste. 106, Gaithersburg, MD 20879; 800-544-6339, 301-590-2760, or fax 301-590-0903; Web site: http://www.zylab.com.

Figure 1.


An example of a Verity Search screen. This is a Web page that Verity built for me to search my DBMS articles.


Figure 2.


An example of a ZyIndex Search screen. ZyIndex lets me search through the many documents I've collected about accounting topics.


Figure 3.


An example of an Excite Results screen. Architext's Excite search engine let me enter a concept search for "Spitzer DBMS articles," and gave me better results than Lycos or Webcrawler.



Subscribe to DBMS and Internet Systems -- It's free for qualified readers in the United States
January 1996 Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS and Internet Systems (http://www.dbmsmag.com)
Copyright © 1996 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated Sunday, December 1, 1996