Managing text as a database resource requires specialized tools and techniques.
Some of the most important data in a business is not kept in records and therefore cannot be put into a database. The data is in text documents such as letters, contracts, manuals, regulations, policies, and procedures. The textbase market is growing at the rate of 50 percent per year, according to a 1991 study by IDC and Delphi Consulting Group. (See Table 1 for a list of text database vendors.) Depending how you count text and image storage, they were right. Most of the growth they predicted has been in centralized data centers. But nobody thought about the proliferation of Web documents, so this estimate falls short and the growth of text information is accelerating.
A document base is text that comes arranged in a structured format. Usually the documents have a header record that gives some information about the document and the body or actual text. Most documents are hierarchies of this header-body structure. Books break into chapters, chapters into paragraphs, paragraphs into sentences, and finally sentences into words. Manuals break into sections, sub-sections, and sub-sub-sections.
Documents such as manuals may have embedded graphics and illustrations, but I am not going to discuss searching the images. That is a separate article in itself.
There are two basic types of searches: searching the document headers and searching the body of the documents. Much of the initial searching is performed on the headers, so it is important that they all have the same format. A body or text search depends on pattern matching within the text itself.
A search must return a result at a certain level of granularity within the body of the document. This level is usually adjustable. Obviously, searches that return all occurrences of a particular letter or word are useless. The smallest practical unit is a sentence (or enough characters so that you will most likely see all of a sentence). The largest practical unit is the document - simply because the next level of this hierarchy is the whole document base itself. There is nothing like tables and rows in document databases; although a sentence or paragraph can only have meaning in context, rows represent complete facts about the class of entity that the table represents.
The header is a record with fixed fields that apply to all of the documents under consideration. It can be searched like a relational database, which is much faster than searching the text. The header information can include almost anything, but the most common fields are the title, author(s), publication date, classification codes, and an abstract or a keyword list. Most products use a comma-separated list of character strings for these multi-valued fields for authors(s) and keyword lists, because they already have the code to search strings and because there are just too many ways to encode the data.
The headers must be created by a skilled reader, but there have been attempts to automate this process. For example, Oracle's ConText server attempts to perform an automated summarization. The narrower the scope of the textbase, the easier the automation is - but it is not perfect yet.
A keyword list is a list of words or phrases drawn from a pre-defined set of words or phrases that have meaning to the searcher. For example, this article might have a keyword list containing "textbase," "text retrieval," and "document retrieval" in its list. A keyword does not have to appear in the body of the document itself. For example, few satires would include the word "satire" within their texts. Keywords are a powerful search tool, but using them also requires a human being with knowledge of the field to design the vocabulary and make the choice of keywords. The vocabulary will have to be updated on a regular basis, adding, changing, and dropping terms.
Notice that the keyword list approach works for a single document, not for a collection of documents. Another version of this approach is the keyword in context (KWIC) systems, which use a set of keywords and locate the words within phrases from the document. The phrases are usually titles, but they can also be short descriptive sentences from the original document. The old printed versions of KWIC indexes would print the phrases around a gutter to highlighted the keyword; thus you might look up the keyword "domino" and see a listing like this:
DOMINO Games for Children
Chinese DOMINO Games
Internal queues in the DOMINO
Operating System
DOMINOES and Other Stories
Another approach gives each document a general classification code to narrow a search as soon as possible. The Dewey Decimal System and Library of Congress (LOC) classification schemes do that for library books. At least one Web site uses the LOC scheme, but most of the Web search services have their own hierarchical schemes with the option to perform a full search.
If a document elects to use sub-headers within the sections of the document, they become simpler in structure. For example, novels have chapters that have only titles (or titles and synopses if it is a 19th-century novel), and paragraphs have no headers. However, a technical manual might break this down further into sections and sub-section titles.
Semantic searching is more difficult because the search engine must understand the meaning of the documents and the search question. For example, an ideal semantic text-search engine would have taken my query about "dominoes" and asked if I meant the Domino operating system or the game. When I answered "the game," it would then look at the documents available and determine which ones deal with games. Then it might ask if I was interested in fiction and/or non-fiction.
Even better, the engine would have some machine intelligence to associate things that are not normally indexed together; it would tell someone looking at surgical techniques for damaged livers to consider a magazine article on underwater basket weaving because both procedures use the same methods. Obviously, there are not many working semantic search systems, and the available ones are expensive.
Noise words are words such as "the," which are so common in English (or whatever language) that people would not search on them because every document will have them. These are usually linguistic structuring words such as articles, conjunctions, pronouns, and auxiliary verbs. But the noise word list can also include numerals, punctuation marks, numbers, single letters, and words so common to a particular discipline that all documents would include them. If you would like a list of the 1000 most common words in English, you can find them in an appendix in Cryptograms and Spygrams by Norma Gleason (Dover Books, ISBN 0-486-24036-3, 1981). As a rule of thumb, removing the 150 most common words will reduce English text by 60 to 70 percent of the original word count.
Because of the homonym problem, building a noise word list can be harder than you might think. For example, chemical abstracts must use one- and two-letter chemical symbols, so you must be sure to leave "He" for Helium in the documents. But most noise word lists ignore "He," regarding it as the third-person singular pronoun. The best solution would be to parse the text for both words to see the context and from there tell the difference.
Some people think that noise words are a bad idea in spite of the savings they offer in index size and improved performance. In looking for quotations, word structure can be important. The phrase, "To be or not to be, that is the question," has only "question" as a possible search word. Likewise, in "Ask not what your country can do for you, ask what you can do for your country," only "country" is significant.
Once you pull out the noise words, you can build the index. The index has a certain granularity, which is the level of the document to which it points. The index and retrieval granularity do not have to be the same. If the documents are short abstracts that can be shown on a screen, then the index can use word-level granularity and the retrieval can return results at the document level. If you have used Dialog information services, you have seen this approach. Once you have read the abstracts, you have the option of receiving the whole document in hard copy offline or downloaded online. If the documents are longer, then the retrieval level is usually the paragraph, but it can also be as fine as the sentence.
Index granularity can be as fine as the locations of each search word within the document (usually by byte position within the file) or as coarse as the name of the document in which the word occurs. Word-level granularity makes the index bigger, because each word will probably occur in many places. Another problem is that any change to the document will alter the location of all words that occur after the point at which the change was made. Document-level granularity returns the names of the documents very rapidly, and it is easy to update. However, once you find a list of candidate documents, you must search it in a linear fashion for text patterns. This can take quite some time for long documents. In practice, most systems will keep one index fvý—ý}ý,ýu=Pý½ýýý?ýýýõ'ýrýý52ý\#$ýFMý&j}ýüzýýuýýýýýý^ýýý8ô~\wýýoA2klP2}ýý ^ýýCl7PSýtýAýQçý lýaýj\ýýeývýqý1ýýs0.ýýkýýýjýAýýýýýý^ýHýiwýýý}ýýNxýý4ýý^ýý8|ýýýEýCB]ý3ý/ýý_ýýcý}?Cfý.gj^ wýuýgý[ýýc¥ý7ýýýý9&Yýý3eýx@dýýy ýT"SýtýýýFNýýýaAýýý%ýý}ýýývýýýýhýýHýmýýDþV0ýdr/ýý ýýý!ýýoýýýÉý/ý>+ýGZý'ýýR+ýý5pvwýýýKN{ýýVNýýý=ýVýýGýý2výýýý^F<ýýýZý0#e8~ýýhý1ý43ý?ZýGU1ýýSýÑýýbý~ýæý%qfý:ks<ýžý0ã$ýýzUXý'rÆDä@ýuýýý/^ýýýa&?ý*ýbQýk@ýýh*++ýýýM}*ýýaýýýý3:ýESýý/Neý>ýýýývýýýý3ýý@G&ý_ýýý9"Iý¯ýIýý added, then a new page can be created for the overflow, and all page numbers that follow the insertion are incremented. Search problems occur when a search needs to go over page boundaries - a search for "John Smith," where "John" is on one page and "Smith" is on the next page. However, such searches are exceptional and do not affect performance in most cases.
The next most complicated operator is a pattern matcher, which uses wildcards. Wildcard conventions vary greatly from product to product, but there is usually a single-character wildcard, such as the "?" in DOS command lines or "_" in SQL's like predicate, and a multi-character wildcard, such as the "*" in DOS command lines or "%" in SQL's like predicate.
The user must know if wildcards match one or zero characters in a word. For example, if a "?" stands for exactly one character, then the pattern "chocolate?" will match the word "chocolates," but not "chocolate."
I have already mentioned automatic synonyms for plurals. Think of a synonym as a replacement of one word by a list separated by ors. For example, the query ("chocolate" and "wife") might expand out in the system as (("chocolate" or "chocolates") and ("wife" or "wives")).
There are three major types of synonyms. A grammatical synonym is a different form of the same word. For example, "be" has grammatical synonyms "am, is, are, been, was, and were." Most textbases will have irregular plural forms in their thesaurus. Many automatically recognize grammatical suffixes, such as "-s," "-es," "-ing," and "-ily," and they can store rules about doubling the final consonant or dropping the final "e" when forming plurals. This feature is also found in spelling checkers in word processors.
The grammatical forms are often automatic and generated by a parsing program. User-controlled synonyms are kept in a thesaurus file whose file format will differ from product to product. True synonyms are semantically equal, such as "boat," "ship," "craft," and "vessel," and in most cases they can all be swapped for each other in the same context. Narrow synonyms are specific cases of the concept, such as "Europe" narrowing down to countries such as "France," "Germany," and "Italy." Broader synonyms are more general cases of the concept, such as "Eastern Indians" or just "Indians" as wider cases of "Cherokees" in a particular search context.
Synonyms for dates and numbers are a special problem. They are true synonyms, but their forms are very different and can be homonyms for other concepts. For example, "one," "1," and "I" are the English word, Hindu-Arabic numeral, and Roman numeral for the same concept. The Roman numeral is also a homonym for the first-person pronoun. But this list does not show 1.0, 1.00, 1.000, and so forth. The English word "pi," the Greek letter "P," and the numbers 3.14, 3.1415, and 3.14159 are all synonyms. But just how many decimal places should the thesaurus store? The trailing digits cannot be detected with a parsing program.
Proximity operators look for words within a certain distance of each other, where distance is measured by a word count, sentence count, paragraph count, or other structural unit. The usual scope is the whole document, but some products let users set up the borders of their searches to the text between a start word and a finish word (as in find the name "Heathcliff" between "Dear" and "Sincerely" in a textbase of love letters).
The quorum operator looks for documents with (k) out of a list of (n) words or patterns, as in "2 of [German, French, Spanish, Czech]." The quorum operator is often used to restrict a list of synonyms, but it is a shorthand for a Boolean expression.
Weighted searches assign a score that attempts to measure how well a document fits the query. This process can be very useful when searching a large textbase; it can also be disastrous if the weighting function is flawed.
There are several types of word scoring schemes. The simplest method is a tally of the presence or absence of query words from a document. No special weight is assigned to one term over another. The second method is to report the number of occurrences of each word or pattern. It is assumed that the documents with the most hits are the best ones. A trick used on the Internet to raise the score of a Web site in search engines with this approach is to fill a comment field with repetitions of a few key words.
In a mixed strategy, each word or pattern gets a weight, which is multiplied by the number of occurrences, to give a score. This strategy is a little harder to implement, because you must assign weights somehow. The second method is really a special case of this, with a weight of one for each search term.
The fourth method is to have a semantic tree structure that assigns heavier weight to more specific synonyms. This means that the thesaurus must know the difference between broader and narrower terms. A search for "Southwest*Indian?" would give more points to documents containing names of particular tribes ("Hopi," "Zuni," or "Navajo"). A document with "Amerind" or "Native American" would get a few points, but a document on Slavic peoples would score nothing. Again, these term weights can be multiplied by the number of occurrences to give a final score.
More elaborate weighting schemes will give points for proximity, order of the search terms in the document, and statistical distribution within the document. In short, they are trying to "read and understand" the document in a crude way.
Borland International distributes a grep( ) with its compiler products that is modeled after the original Unix version. The command line syntax is GREP [-<switches>] <pattern> <files>, but you can ignore the switches that change the output format.
The <pattern> is a regular expression, which is defined as one or more occurrences of the following characters optionally enclosed in quotes. The following symbols are treated specially:
^ start of line
. match any single character
* match zero or more occurrences of preceding character
$ end of line
\ next character is a literal, not a symbol
+ match one or more occurrences of preceding character
[ ] match any character from the enclosed list
[^] match any character not from the enclosed list
- all characters between left and right characters in ASCII order; for example, [0-9] is short for [0123456789]
Regular expressions can match a great many character strings. For example, any integer will match to "[0-9]+"; any decimal number to "[0-9]+.[0-9]*"; any word to "[A-Z]*[a-z]+" (note that capitalization is allowed); and even patterns such as "[^aeiou0-9]" match to anything but a vowel or a digit.
GREP( ) builds a table-driven list from the pattern and very quickly parses words for a match. The patterns can even be optimized for faster parsing using algebraic reductions such as "aa*" = "a+" to improve performance. However, some patterns cannot be written as regular expressions. For example, there is no regular expression for (n) occurrences of "a" followed by the same number of occurrences of "b" in a string.
System-related commands include:
Notice that there is no logon or password command. DELETE can remove a wide range of user-defined system objects but has no power over the documents and indexes themselves. The HELP command you will see later is like the EXPLAIN command, but it applies only to a particular session. The next class of commands modify a session:
Commands that can be used in a session are:
Boolean operators are the usual AND (union), OR (intersection), and NOT (set difference). Prior result sets can be mixed with search patterns in Boolean expressions. The precedence of operators is NOT, AND, and finally OR, which can be changed by use of parentheses.
The ISO 8777 Document Search Language specifies a left to right order of evaluation, but the ANSI language is silent on this. Character patterns are done first, then word patterns, and finally the Boolean operators.
Ranging or limiting operators are for use with numbers only. They are "GT" or ">" (greater than), "LT" or "<" (less than), "GE" or ">=" (greater than or equal to), "LE" or "<=" (less than or equal to), "NE" or "<>" (not equal to), and the hyphen or keyword "TO" for between.
You would think that these products would be easy to use if you know your subject area. Wrong. WestLaw, a service that provides textbases to the legal profession, gives away free time to law school students to get them hooked on the service. This clearly beats the heck out of going to a library and lifting law books for hours - assuming someone else has not already taken them out. WestLaw has found that first-year law students given standard research assignments miss about 20 percent of the documents they need; and of those documents they did retrieve, about 20 percent were not needed.
The most likely place that you will have used a textbase is a Web search service, such as Yahoo, Magellan, Infoseek, Lycos, Excite, Open Text Index, The Electric Library, C|Net, Accufind, AltaVista, Hotbot, Point, A2Z, 100hot websites, and IBM Infomarket, to name a few of the more popular ones. Every type of textbase is represented in these services. After Web-surfing on the engine of your choice in an area that you know well, you will be very impressed with how good a mere plus or minus 20-percent error rate is.
For more information on ANSI and ISO standards:
* National Information Standards Organization (NISO), 4733 Bethesda Ave., Ste. 300, Bethesda, MD 20814; 301-654-2512 or fax 301-654-1721; email: nisohq@cni.org.
* Director of Publications, American National Standards Institute (ANSI), 11 West 42nd St., New York, NY 10036; 212-642-4900 or fax 212-398-0023; http://www.ansi.org.
* Global Engineering Documents Inc.; 7730 Carondlet Ave., Ste. 407, Clayton, MO 63105; 800-854-7179, 314-726-0444, or fax 314-726-6418; http://www.his.com/global.
Dataflight Software Inc.
2337 Roscomare Rd., Ste. 11
Los Angeles, CA 90077
800-421-8398, 310-471-3414, or fax 310-471-5294
http://www.dataflight.com
info@dataflight.com
Concordance for Windows
Dataware Technologies Inc.
222 Third St., Ste. 3300
Cambridge, MA 02142
617-621-0820 or
fax 617-494-0740
http://www.dataware.com
NetAnswer, BRS/Search,
Total Recall, Natural Language Object Library, and others
Document Systems Inc.
A division of AMS Services Inc.
6 Wilton Rd., 2nd Fl.
Westport, CT 06881-5146
203-221-8686 or
fax 203-221-8691
http://www.iix.com/dsi/default.htm
Docu/Master
Electronic Book Technologies Inc.
One Richmond Square
Providene, RI 02906
401-421-9550 or fax 401-421-9551
http://www.ebt.com;
info@ebt.com
DynaBase, DynaBase Web
Management System
Excalibur Technologies Corp.
1921 Gallows Rd., Ste. 200
Vienna, VA 22182
703-761-3700 or
fax 703-761-1990
http://www.excalib.com
info@excalib.com
RetrievalWare product family, EFS Electronic Filing Software, and EFS WebFile
Folio Corp.
5072 North 300 West
Provo, UT 84604-5652
800-543-6546, 801-229-6700, or
fax 801-229-6787
http://www.folio.com
Sales@folio.com
Folio Views product family
Fulcrum Technologies Inc.
785 Carling Ave.
Ottawa, Ontario
Canada K1S 5H4
613-238-1761 or
fax 613-238-7695
http://www.fulcrum.com
info@fulcrum.com
SearchServer, SearchBuilder toolkits, Surfboard, and others
IBM Corp.
800-426-3333
http://www.ibm.com
DB2 Text Extender and
SearchManager product family
Illustra Information Technologies Inc.
An Informix Software Inc. company
1111 Broadway, 20th Fl.
Oakland, CA 94607
510-652-8000 or
fax 510-869-6388
http://www.illustra.com
info@illustra.com
Illustra Server and Text and Web DataBlades
Infodata Systems Inc.
12150 Monument Dr.
Fairfax, VA 22033
800-336-4939, 703-934-5205, or fax 703-934-7154
http://www.infodata.com
info@infodata.com
Inquire/Text and others
Information Dimensions Inc.
6600 Frantz Rd.
Dublin, OH 43017
614-761-7289 or
fax 614-761-7290
http://www.idi.oclc.org
Basis product family
InfoSphere LLC
P.O. Box 225
Pleasant Grove, UT 84062
801-221-5902 or
fax 801-221-5903
http://www.fiber.net/infosphere
info@proindex.com
ProIndex
Inmagic Inc.
800 West Cummings Park
Woburn, MA 01801
617-938-4442 or
fax 617-938-6393
http://www.inmagic.com
inmagic@inmagic.com
DB/Text product family
Lotus Development Corp.
55 Cambridge Pkwy.
Cambridge, MA 02142
617-577-5800
http://www.lotus.com
Lotus Notes product family
Oracle Corp.
500 Oracle Pkwy.
Redwood Shores, CA 94065
800-672-2537, 415-506-7000, or fax 415-506-7200
http://www.oracle.com.
ConText
PC DOCS Inc.
25 Burlington Mall Rd., 4th Fl.
Burlington, MA 01803
617-273-3800 or
fax 617-272-6988
http://www.pcdocs.com
DOCS Open, DOCS Mobile, and others
Personal Library Software
2400 Research Blvd., Ste. 350
Rockville, MD 20850
301-990-1155 or
fax 301-963-9738
http://www.edshow.com/PLS
PLS product family
TextWare Corp.
P.O. Box 3267
Park City, UT 84060
801-645-9600 or
fax 801-645-9610
http://www.textware.com
TextWare product family
Thunderstone Software - EPI Inc.
11115 Edgewater Dr.
Cleveland, OH 44102
216-631-8544 or
fax 216-281-0828
http://www.thunderstone.com
info@thunderstone.com
Texis, Webinator, and Metamorph
Verity Inc.
894 Ross Dr.
Sunnyvale, CA 94089
408-541-1500 or
fax 408-541-1600
http://www.verity.com
Topic product family
ZyLAB International Inc.
19650 Club House Rd., Ste. 106
Gaithersburg, MD 20879
800-544-6339, 301-590-0900, or fax 301-590-0903
http://www.zylab.com
ZyIndex and ZyImage product families