DBMS
 

 


Looking for Data in All the Wrong Places: Strategies for Searching and Exracting Information from Free-Form Data Sources. By Peter L. Brooks
DBMS, October 1997

Data dumpsters. Data cemeteries. C'mon, admit it -- you've heard these terms used to describe information stored in your organization's databases. The amount of data stored in databases has increased dramatically, frustrating users. In addition, there is an enormous amount of information available from new sources -- the Internet, intranets, and extranets -- where it is also difficult to find what you are looking for. Microsoft Word and Lotus Word Pro documents are becoming an informal corporate information exchange standard. With all these information sources, what can be done to help users find the proverbial "needle in the haystack when they are looking for free-form information from database fields, Web pages, or documents? New search technologies are one answer to this question. These technologies allow users to query databases, Web pages, and text documents using specific search criteria to extract the information that is most important while minimizing the amount of extraneous information returned in the search. This article will describe the newer search technologies and techniques that are available to perform ad hoc searching and extraction of information from free-form information sources.

The Need

Searching through exploding amounts of textual data has become difficult, even with the help of ad hoc query and reporting tools. The following factors cause problems:

Search Technology Roundup

Several types of search products are attempting to overcome these issues. In this article, I will focus on products in the following categories:

The technologies in these categories have searching abilities that overcome the limitations of SQL and exact keyword searching. They also increase searching effectiveness by using context-sensitive searching, synonyms, word-root rules, and scoring criteria. Furthermore, their searching abilities increase the effectiveness of search results by ranking them in order of relevancy with the closest matches first on the list. These technologies access information that exists in databases, Web pages, and PC text file documents.

Natural Language Querying

One approach to overcoming the need for users to master the complexities of SQL is to provide a natural language interface to the SQL language. English Wizard by Linguistic Technology Corp. is an example of this type of product. Dr. Larry Harris, chairman of Linguistic Technology, developed English Wizard as the successor to AI Corp.'s mainframe natural language product called Intellect, which he also developed.

English Wizard uses a semantic layer to translate English questions into SQL queries. The semantic layer uses a proprietary dictionary, built initially from database metadata, to define the relationship of English words to information in a database. Besides a basic English vocabulary, the English Wizard dictionary includes or determines from the target database a thesaurus, special date support, database definitions, and join logic. You can use the Dictionary Editor to modify the dictionary by adding business-specific phrases or restricting database views, for example.

Dictionaries can exist in a hierarchy so that there can be an administrator dictionary as well as ones for different subject areas or individual users. Changes in an administrator dictionary are automatically propagated to others lower in the hierarchy.

Users express database queries in English, which are then translated to SQL using the dictionary and passed to a database. If the dictionary cannot resolve the query, English Wizard prompts the user for clarification. Once resolved, clarifications are automatically added to the dictionary.

The English Wizard user interface can be directly accessed by users or installed as an add-in to ODBC-compliant desktop reporting tools such as Microsoft Corp. Access, Seagate Software Information Management Group's Crystal Reports, Platinum Technology Inc.'s Forest & Trees, and Sybase Inc.'s InfoMaker. Results can be displayed in a spreadsheet workbook or an ODBC-compliant desktop reporting tool. Alternatively, a software developer's kit is available to embed English Wizard technology in custom applications. APIs are provided to translate the end-user or application-developed natural query language request into Microsoft Access, Oracle Corp., Informix Software Inc., or ODBC (for IBM DB2 and other relational databases) versions of SQL for application processing. The English Wizard supports OCX, VBX, Powersoft (a subsidiary of Sybase Inc.) PowerBuilder User Object, Borland International Inc. Delphi Control, and Informix NewEra class libraries.

Inverted Indexing

Using inverted indices when searching text fields can help you overcome some of the traditional limitations of relational database B-tree indexing. Relational database B-tree index searching is most efficient when performing exact character matching in a left-to-right sequence. A weakness of B-tree indexing -- one that will often cause performance problems -- is found when you perform generalized keyword searching against multiple free-form text fields with the search target in a random position within the data field. For example, searching for names using a B-tree index works well when names are found in "Last Name" and "First Name" columns. B-tree indexing does not work well when a name can be found in any position within a field, particularly a free-form field such as an address or comment field. A table scan is often required even if the exact text was known. Note that bitmap indices, although they provide improved performance over B-tree indices, are not generally appropriate for high-cardinality text data.

An inverted index stores each occurrence of a column value (or in the case of text, each individual word in a column) in the index with a pointer to the associated database rows. Searching is performed against the index, yielding significantly better performance than B-tree indices or base-table scans for the types of queries described previously.

Dynamic Information Systems Corp.'s Omnidex is an example of a product that uses an inverted index to reduce the time required for keyword searches. Omnidex indexes each word in the selected relational database column(s) into an inverted index. Extraneous words such as "a" or "the" are usually excluded.

From a user's perspective, Omnidex's inverted indexing scheme has search capabilities beyond those of standard SQL. These include nonpositional keyword searches without requiring a table scan, case insensitivity, and the ability to use Soundex for phonetic searches rather than requiring exact spelling. Names with similar spellings such as "Robin" and "Robyn" would be retrieved from one query. Omnidex also provides index flexibility. For example, several columns can be grouped into one inverted index. This approach is beneficial for columns such as ADDRESS_LINE1, ADDRESS_LINE2, ADDRESS_LINE3 that could all have similar street address information. Alternatively, only a portion of a column needs to be included in the index.

Indices can be updated in real time as data changes, asynchronously in background mode using application transaction log records, or using a periodic batch update/rebuild process. Application API routines -- such as OAINSERT, OAUPDATE, OADELETE, OAINSERTINDEX, OAUPDATEINDEX, and OADELETEINDEX -- are used to manipulate the indices.

Inverted indices are most appropriate for columns of unstructured, free-form textual data in large databases. They are most efficient when queries do not return a large proportion of the records -- otherwise a table scan would be just as efficient. Inverted indexes provide significant improvement in end-user query performance, but they require a potentially large amount of extra storage, system resources, and index update time.

Internet Search Engines

Most readers will be familiar with one or more of the numerous Internet search engines that let them search the large volume of data on the Internet. Such engines are examples of products that use indices to provide search capabilities. Most of these products work by having Web "spiders" search Internet Web pages and discussion groups, identify the key words, phrases, or concepts in each page, and then automatically add the words/phrases/concepts and Web page URL to a massive virtual index. The index is called "virtual" because it may exist on one server or may be distributed across many servers. Although its end result is generally the same, Yahoo (www.yahoo.com) is an example of a search engine that uses humans rather than Web spiders to review, categorize, and index Web pages. Regardless of the development process, the massive indices are searched when users enter keywords into the product's search data-entry field.

From a search perspective, Internet search engines generally extend the capability of keyword searching by using concept and content searching. Concept searching includes linguistic rules in the search criteria to expand the scope of a particular search. For example, a search using the keyword "pants" may be expanded to include the concepts "clothes" and "jeans." The other words are related concepts, not synonyms. Context also needs to be considered so that such concepts are not related to "pants" if "pants" is being used as a verb (as in gasps or breathes quickly) rather than a noun.

Internet search engines often improve search retrieval results by ranking information by relevance. Relevance is typically obtained by analyzing results using statistical techniques, such as word frequency within the retrieved data, proximity of keywords with each other, and positioning of the keywords within the data.

Many Internet search engine vendors are expanding their services to corporate intranets, databases, and text documents. Search engine technology can often be embedded in applications developed by other vendors. I use Verity Inc.'s Search '97 and Excalibur Technologies Corp.'s RetrievalWare as examples of Internet search engine products.

Verity's Search '97 is designed to allow the indexing and tracking of information from Web pages, discussion groups, email, relational databases, and other data sources. Search '97 supports both the "pull" and "push" models of information distribution.

Search '97 Architecture

Verity's Search '97 architecture consists of several key components:

Verity is integrating database-access technology from 64k (which Verity recently acquired) into Search '97 in the agent-server layer to provide relational database access with improved functionality and performance. Internet/intranet Web page and text access will continue to be handled by Verity's existing product. 64k's relational database relevance ranking will be combined with Verity's text capabilities in this area. ODBC will be used to access databases.

A key feature in 64k's DBGuide product is the ability to show a summary of the entire database contents, or a subset, as the start of an iterative search process. Users do not have to pose a number of queries in order to find out what is in the database. A fuzzy query-by-example capability enables you to select a record and tell DBGuide to "show me records similar to the one I chose."

DBGuide is based on use of patent-pending "Intelligent Index" technology. The index is built from a relational database schema -- a system administrator need only specify the database location and index update frequency for the index to be created. Advanced datatypes such as phone number, ZIP code, and URL will be provided.

RetrievalWare Architecture

Excalibur's RetrievalWare architecture consists of text retrieval and profiling servers, visual media servers, and a Web server. Excalibur uses its Adaptive Pattern Recognition Processing (APRP) and semantic network technologies to perform data capture, indexing, searching, and retrieval. Use of the APRP search capability is based on neural network-like models rather than specific keywords. Indices are created using the binary patterns in digital text data. One advantage of this technique is that it eliminates the need to define keywords manually; another is that it supports fuzzy searching.

Excalibur's semantic network performs concept searching based on the meanings and relationships of words in dictionaries and thesauri. Users are allowed to select the meanings of words used in queries and display the meaning of words that were used to determine relevance. A reported 400,000 concepts and 1.6 million words are included.

Database Text Searching

The application extendibility and flexibility of object-relational databases allows for significantly improved search capabilities for text stored within the database. Key functionality of the Personal Library Software (PLS) Personal Librarian product has been incorporated into the Informix and PLS Text DataBlade for Informix's Universal Server. IBM's Text Extender searches against information in DB2 databases. Thunderstone Software's relational database was built for Internet/intranet text storage and retrieval applications.

The PLS Text DataBlade is based on a Callable Personal Librarian (CPL) module and set of object-based C library APIs. The CPL performs text processing, querying, relevance ranking, and text retrieval. Concept searching is performed by dynamically expanding queries to include terms related to those in the query. Fundamental to the architecture of the PLS Text DataBlade is an inverted index that cross-references index terms with the database rows in which the terms occur. In the PLS index, a dictionary is used to maintain the index structure. The dictionary is created by:

Indices are related to columns and created via a CREATE INDEX...USING pls(columns) command that adds text terms to the index. Once created, changes to the indexed database columns by any Universal Server command will be reflected in the index. The following datatypes are supported:

To provide relevance rankings, a PlsOidRank_t datatype is created using the CREATE TYPE PlsOidRank_t(pls_oid oid, pls_rank integer) command during the PLS Text DataBlade installation. The LIBMI interface is used to create operator classes for the textual datatypes.

IBM's DB2 Text Extender lets you perform linguistic searches against textual data that is either stored in DB2 or that is managed by DB2 but exists in common file or PC document formats. Wildcard, proximity, linguistic, and stem searching are examples of the available search capabilities.

SQL can be used to perform the text searching. To find the words "Internet," "text," and "search" (or its synonyms) in the same paragraph of text indexed in the DB2COL column of the DB2 table called "TextTable," the following SQL could be used:

SELECT * FROM TextTable WHERE DB2TX.CONTAINS (DB2COL, "internet" IN SAME PARAGRAPH AS "text" AND SYNONYM FORM OF "search") = 1

(DB2X.CONTAINS is a DB2 Text Extender search function.)

You can create three types of indices against a text column:

Precise indices are best used when you are looking for exact matches. This is generally the fastest type of searching. Precise searching for the word "leaf" will result only in matches for that word, not variations such as "leaves." Masks and wildcards can be used. The key benefit of linguistic indexing is that search term variations are matched automatically. Searching for "leaf" will return documents containing both "leaf" and "leaves." Dual indexing allows for both precise and linguistic searching. Dual indexing requires the most storage. Indices can be created on a column-by-column basis or globally for all text columns in a database.

The DB2 Text Extender can be installed in a shared address space with DB2 for maximum performance or in its own address space. The Text Extender is part of the code base that will be merged into IBM's DB2 Universal Database. The product supports 17 languages.

Thunderstone's Texis relational database uses what it calls a "coarse-grained inverted index" to enable searching of textual information stored either within the database or in external files (which therefore use the INDIRECT datatype). The index is created with the sql CREATE INDEX command and typically requires 10 to 15 percent additional storage. To improve performance, you can place indices on storage devices that are separate from the underlying data.

Although users can access Texis directly, the database is most often accessed by programs that use SQL with several proprietary keywords. Truncation, wild cards, proximity fuzzy matching, and concept searching capabilities are available. Results can be ranked by relevance. Query results can be saved in temporary tables for iterative searching.

A key SQL like clause extension is the ability to include the functionality of Thunderstone's Metamorph search engine by supporting Metamorph syntax. Thus, SELECT AUTHOR_NAME WHERE BODY_TEXT LIKE "text internet w/para" will retrieve the author of all articles where the words "text" and "internet" (case insensitive) are in the same paragraph within the body_ text column. The "w/" is the Metamorph proximity keyword. Other proximity keyword parameters can be used to restrict proximity to be within a sentence or a specific number of words. Synonym capability can be enabled as well.

The LIKER SQL extension keyword will return data in order of relevance based on the presence of the requested terms. likep will use proximity searching.

Much of the Texis database text search infrastructure is customizable: the 250,000 word thesaurus, the stop-word list (noise words that need not be indexed), and the "concepts" database that are distributed with the product can all be updated by users. Internet and intranet applications can be created by using Texis's Webscript language.

Other Database Searching

Object Design Inc. has developed the ObjectStore Text Object Manager -- based on Verity's Topic Search Engine and Topic Developer Kit -- to provide keyword matching, concept searching, relevancy ranking, and the other Verity text searching capabilities. Text documents are cataloged and stored in their native formats and can be encapsulated with attributes such as author, revision date, topic, and content type. Additional attributes can be added as needed.

Dataware Technology Inc.'s InfoMagnet technology filters information from numerous documents and other formats, including Lotus Notes. MagnetReader dynamically reads, indexes, and filters realtime data for distribution to users. MagnetSweeper reads, indexes, and filters static Web or file server data. You can customize InfoMagnet with HTML and JavaScript; a natural language user interface is also included. InfoMagnet supports keyword weighting, concept searching, and relevance ranking techniques. Users can indicate which retrieved documents are relevant and which ones are not so that InfoMagnet can incorporate a finer degree of differentiation into the user's profile.

Databases are being used to store digital data that is far more complicated to search and retrieve than text and numbers. Database vendors and third parties are now able to support searching of digital content itself, rather than being restricted to textual search terms associated with each content instance.

Informix and Virage's Visual Intelligence Retrieval DataBlade performs visual and textual searching of the Informix Universal Database. The Visual Intelligence Viewer (VIV) is the graphical viewer companion DataBlade. To perform a search using the VIV, you identify a template of the graphic that is to be searched for, and then use slider controls to indicate the importance of graphic characteristics in the search. You can perform searching of both graphic images and associated text fields. The popular image formats are supported, and text, audio, video, and other search DataBlades are available.

IBM's DB2 Image Extender provides an image datatype and functions that support images up to DB2 version 2's maximum size of up to two gigabytes for large objects. The Query by Image Content (QBIC) capability allows for the searching of images containing particular colors or textures. Queries can include both image and textual information associated with images so that, for example, pictures that include the color "red" and that were photographed before 1996 could be selected. Image formats such as GIF, JPEG, BMP, and TIFF are supported; DB2 text, audio, video, and other search extenders are available.

Visual RetrievalWare is Excalibur's digital image search and retrieval product. Shape, color, and texture information about images are obtained and then added to a searchable index that is typically 1 to 10 percent of the original image size. A software developer's kit provides a "C" API programmatic interface to Visual RetrievalWare, including the ability to condense images into what is called a feature vector that can be indexed and searched using literal image content matching. There are two related databases: an index with the feature vectors and a database of actual images. VBXs and relational database DLLs are provided on Windows 95 and NT platforms. Excalibur has created a face recognition Informix Universal Server DataBlade module that allows matching of face patterns. You can use a SQL query interface to query the images.

Tradeoffs

Natural language querying, inverted indexing, Internet search engines, and database text searching functionality are beneficial in many situations, however there are considerations that should be understood as well. First, don't rule out SQL queries -- providing end users with a point-and-click SQL front-end tool may satisfy their reporting needs. The search technologies that I reviewed here are best for searching through large amounts of free-form data. If the target data is highly structured, you may not need advanced search capabilities at all.

Natural language querying, although conceptually appealing, has never really caught on, probably because the complexity of the English language has not allowed the technology to achieve 100-percent query accuracy. In addition, the knowledge and time required in administrative setup and translation of database terminology to English should not be underestimated. The latest generation of natural language query tools, such as English Wizard, are designed to overcome these inhibitors.

Inverted indexing can lead to significantly improved query performance. However, this improved performance is at the expense of the disk space required to store the index and the extra processing time required to update the index, either in real time or in batch mode.

Internet search engines currently work best against Internet or intranet Web pages and discussion groups. Vendors are beginning to support the searching of corporate databases but there is not yet a lot of user interface, indexing, and search performance experience in using Internet search engines to query relational databases.

Similarly, there is not much history of searching text or the various digital content that is available in databases. However, the structure of universal databases and the capability to create DataBlades or Extenders seems to be well-suited to advanced searching capabilities, particularly of text objects.

Beyond SQL

As the number and size of data sources increase, it is becoming difficult for users to perform effective ad hoc querying and searching. Several technologies provide capabilities far beyond those of SQL keyword and character string searching: natural language querying, inverted indexing, Internet search engines, and universal database searching.

Products using these technologies typically include the use of a dictionary, thesaurus, semantic rules concept searching, high-performance indexing, and relevance ranking functions that SQL simply does not provide. Nondatabase sources, such as Web pages and Microsoft Word documents, are supported.

When end users can't effectively retrieve information that exists within your organization, it is time for you to consider implementing some form of search technology. It will save you and your end users significant time, and it will open up potentially priceless information that would not be available by other means.


Peter Brooks, based in Boston, is a management consultant with Coopers & Lybrand Consulting's Integrated Strategic Services organization. He specializes in helping organizations expand strategically and competitively through the application of business intelligence systems, data warehouses, and Internet/intranet technology. You can email Peter at plbrooks@compuserve.com.


What did you think of this article? Send a letter to the editor.


Subscribe to DBMS and Internet Systems -- It's free for qualified readers in the United States
October 1997 Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS and Internet Systems (http://www.dbmsmag.com)
Copyright © 1997 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated September 18, 1997