I have a problem. Over the years, I have collected thousands of documents, all of which take up hundreds of megabytes of disk space on my computers and on the LAN to which I'm usually connected. My collection is growing dramatically as I surf the Internet and retrieve information on an ever expanding universe of topics, products, and technologies. I am able to find anything only because I trained myself to function as a human topic catalog early in life. When I was a student, I fancied myself a journalist and collected all manner of scraps of information. Before starting to collect information on disk, I already had sizable collections on paper, which was searchable only manually. Alas, as I enter middle age, I find that the breadth and depth of the information I and my colleagues are collecting is exceeding the capacity of my internal catalog system. We need tools.
There are several interesting facets to this problem. For one thing, it looks like a database problem, although the data fails to reveal itself in a particularly regular structure. Aside from its irregularity, a big problem with unstructured information is that it has very low information density. The facts I may need are dispersed and distributed throughout the material. Also, my documents are in a variety of formats, reflecting the document-creation tools that my company has used during the last decade. The distribution of documents across multiple file servers and Internet servers presents another challenge. Finally, I dream of a way to create a local index of interesting Internet sites that would enable me to return later when I need to see their contents. This would minimize the amount of document retrieval and local storage that I need to support my forays into cyberspace.
I realize that my problem is small compared to organizations, which retain much more information than I. My collection of information represents only a fraction of what my company has acquired in its 15-year life span. By most accounts, we are a relatively small company. Around the world, while billions of records exist in structured, field-oriented databases, the vast majority of corporate information is still created and stored in documents. The Gartner Group estimates that more than 90 percent of corporate intellectual property exists as text in various document formats. This corporate information base is a critical resource for knowledge workers, but cannot be accessed by most DBMSs.
A full-text retrieval system should allow you to enter data in an unstructured (free) format, as is typical in books, memos, and other text documents. Users must also be able to search and retrieve information based on the words and concepts in the documents, in conjunction with information recorded about the documents in structured fields. For example, a magazine article would also have an author name and a publication date. To fulfill these requirements, the core of these systems consists of very sophisticated indexing engines. It's interesting and a little amusing to find that many products claim to incorporate the most sophisticated indexing system on the planet.
To determine how well a document fulfills a query, search engines measure the number of occurrences of search terms, the relative density of search terms in a document versus their density in all the documents in a search, the proximity of multiple search terms, and the location of search terms within a document. All of the products seem to have their own weighting algorithms. Some incorporate a user-modifiable thesaurus, which enables a search to find all documents that include the specified words and any words with the same meaning. Some products incorporate sophisticated stemming techniques for finding all occurrences of words with a common root. Xerox produces specialized lexical software for stemming, which other software vendors license. Probably the hottest technology in the field is concept or topic engines that allow users to specify concepts or topics, which the engines convert into word patterns. Most concept engines require user-defined concepts, which you can think of as saved queries. Newer technologies are being developed that allow concept searching without requiring you to first define concepts.
The products I discovered fell into several categories. I will describe two varieties this month, and leave some interesting emerging technologies for future consideration. The first category of products consists largely of programming APIs and indexing engines that developers can use to integrate document search and retrieval capabilities into corporate and commercial applications. The second category consists of full software products that provide the indexing and search engines as well as an environment in which the user works. Within each category, there are several variations. All of the vendors recognize the emergence of the Internet as a worldwide network of information resources and are developing products for Internet applications.
The Enterprise server indexes documents -- regardless of their location or format -- into a searchable Topic index, or "collection." The Topic Client is an attractive document navigation and browsing application whose key differentiating feature is agent technology. Topic Client provides "intelligent agents" that act on the user's behalf to search and retrieve specific types of information. Topic agents can sift through historic data (searcher agents), or you can deploy them as "watcher agents" looking for relevant information as it arrives. For example, Topic agents can watch data from a financial newswire (when coupled with the Topic News Server) to find information related to a user's stock portfolio. Investors can get timely results that help determine when to make portfolio decisions. The Topic Internet server adds the Topic engine's sophisticated full text indexing, search, and retrieval services to an existing Web site. These tools allow users to conduct personalized searches across Topic-indexed information stored within multiple sources and formats.
With the Topic engine, documents are not modified or altered for indexing. Documents are stored in their original locations and their native formats -- always ready for retrieval, deletion, or modification upon request. The Topic engine supports several data formats, including ASCII, Adobe Acrobat's PDF, SGML, HTML, and more than 50 word processing, spreadsheet, and desktop publishing formats. The Developers Kit offers a suite of optional database gateways that provide access to documents stored in Oracle, Sybase, or ODBC-compliant databases.
I spoke to Michael Williams, Verity's product manager for the Internet Server product. He demonstrated the company's indexing and search engine by indexing my online collection of articles, and creating a search form for me to use to search them. The search form was an HTML page on Verity's server at http://www.verity.com. (See Figure 1.) The form was set up so that when I clicked the Search button, it would run a CGI script to call a program that searched the index of the articles at http://www.database.org. From the results, it dynamically generated an HTML page listing the matching articles, with hyperlinks pointing to them, so that when I selected one from the list, it would retrieve and display the full text of the article.
Verity is only seven years old, and was launched in large part to respond to the information management demands of both governmental and private intelligence organizations. That makes a great deal of sense. Such organizations have been collecting documents for hundreds of years. Recently the company has had substantial success licensing the Topic technology to prominent providers of document management and Internet products as well as to online services, including Adobe Systems Inc., IBM/Lotus (which incorporates it into Notes), Netscape Communications Corp., Quarterdeck Corp., and MCI's Delphi Internet.
SearchServer organizes and references textual information through table structures. Each row in a SearchServer table corresponds to a document (or text object). You can optionally define columns in the table schema to record additional information not stored in the document itself. Unlike traditional database systems, you do not have to store the document text physically in the SearchServer table; SearchServer's text reader architecture allows organizations to keep documents in their existing locations in their original formats, such as Microsoft Word or WordPerfect files. Text objects stored in a relational database may be searched and retrieved transparently through SearchServer's database text reader architecture.
SearchServer builds a comprehensive index of terms to provide high-performance searching. During the search process, SearchServer accesses only the index files; the actual documents are accessed only for display and indexing purposes. SearchServer supports incremental updating of its indexes. It gives you the option of updating the index in batch mode or whenever the table is modified.
The Fulcrum SearchBuilder products are toolkits for Visual Basic and PowerBuilder. These kits include custom visual controls and APIs for connecting to SearchServer and application templates. These toolkits enable developers to rapidly prototype, develop and deploy SearchServer-based custom applications using the same development tools they use for their more structured database applications. You can also connect to SearchServer via any ODBC-enabled application to build and maintain text-centric applications using off-the-shelf third-party tools.
A "concordance" is an alphabetical index of all the words in a text or group of texts, showing every contextual occurrence of a word. With the Concordance software product, you create databases of documentary information and load them with documents, fragments of documents, or keywords with pointers to documents. Concordance tables are much like the tables you would create in a typical database product, with the added wrinkle that they provide a paragraph data type into which the database builder loads the textual data. After the database is loaded with documents, the index function builds a concordance of all the words stored in paragraph fields. When searching a Concordance database, you can create queries that combine field elements (for example, documents created after January 1, 1995) with text. Concordance has 20 search operators that are grouped into four functional areas: context, proximity, Boolean, and relational. The proximity operators adj and near allow the searcher to find words that occur up to 99 words apart. The Boolean operators include the operator xor , which is an exclusive-or operator that locates documents that include one but not both of the search terms.
ZyIndex, introduced in 1983, was the first PC-based full-text retrieval system. ZyIndex locates documents based on content, and provides an extensive toolkit for search customization. These include Word, Phrase, Boolean operators, Proximity, Quorum, Numeric or Date Range, Separators, and Wildcards. ZyIndex's progressive search feature lets you start a new search based on the results of the previous one. (See Figure 2.) Like Concordance, ZyIndex supports definition of data fields and compound searches based on a combination of fields and full-text information.
ZyIndex turns every word in your documents into keys that enable content-based retrieval. ZyIndex gives you search results within seconds, which can be ranked for relevancy based on user specifications. It leaves document files in place and searches through them in their native formats, with a wide range of supported formats, including WordPerfect, Microsoft Word, Lotus 1-2-3, Excel, and even PKZIP. After using ZyIndex to find a file, the user can open the file with the application program used to create it. In addition to making it convenient to make immediate modifications to source documents without ending a search session, ZyIndex provides an embedded hyperlinking feature for adding remarks, verbal notes, and video files to a source document without editing the actual text of that document. ZyIndex will replay any such annotations during subsequent viewing sessions.
ZyLab, the publisher of ZyIndex, is trying to catch the Internet wave with its ZyIndex for Internet product. ZyIndex for Internet is a Web server that runs on Microsoft Windows NT. When connected to the Internet, it lets users access full text indexes created with ZyImage or ZyIndex, using standard Web browsers. ZyIndex for Internet automatically generates HTML based on the user's query and related document content. Once the home page and a few simple templates are set up, Internet providers are spared from writing HTML. ZyIndex for Internet significantly speeds-up Web searching. Transactions are performed on a single index instead of on a series of sequential files. Selected information is converted to HTML in real time and passed to the client Web browser in the form of a list of relevant documents with pointers to the source files. This reduces bandwidth requirements for the client as well as download time.
askSam has the most extensive feature set of the desktop products, offering many character and page formatting options, authoring tools, entry form and report design tools, mail merge, database statistics, and graphics file support. These features are all packaged in an attractive and fairly intuitive Windows product. I have used it for a while to manage my electronic correspondence database. Building a list of the 50 or so messages (out of more than 3000) that contain the string "DBMS" in either their subject or their body is virtually instantaneous. The list provides hyperlinks back to the source message so that I can immediately see the contents of any one of the matching messages. My only frustration with askSam now is that it does not interact more fully with a Web browser. When I click on a URL in askSam it displays the URL details in a dialog, which I can copy and paste into my browser (I'm using both Netscape and Microsoft Internet Explorer). It ought to save me the step by handing the URL off to the browser, or supplying an embedded browser of its own. On the receiving end, I have to save the site I'm browsing and then switch to askSam to import it. I'd really like to see this work as a one-step process. For instance, askSam could provide a utility that hooks onto the browser and appends the Web site I'm browsing into an askSam database.
Another recent entry is LivePage from a company called Inforium, the Information Atrium Inc. Some well-known computer scientists at the University of Waterloo, Ontario, including a cofounder of Watcom and a former CEO of Waterloo Maple Software, came together to build a system of open, non-proprietary text and information management software products. They built their system on SGML and SQL standards; it stores SGML documents in popular relational DBMSs such as Oracle, Watcom, Sybase SQL Server, or Microsoft SQL Server. You can also use LivePage Tools to build custom database solutions with Visual Basic, PowerBuilder, and C. In the future, a standards-based approach that takes advantage of existing DBMS products and tools will make a lot of sense to organizations looking to implement text-retrieval applications.


