DBMS

DataMind Professional Edition 1.0

By Maurice Frank
DBMS, October 1996 DataMind is a Windows-based data mining system that can analyze databases to reveal patterns and relationships between values in fields.

Data mining is the latest craze sweeping the data warehousing and decision-support arenas. Data mining includes a variety of analytical techniques that promise to reveal previously unknown patterns and relationships in large transactional databases. (For more background on data mining, see Bruce Moxon's article, "Defining Data Mining," DBMS, August 1996, page S11.) The DataMind product family from DataMind Corp., a startup in the rapidly growing data mining market, is a Windows-based program targeted to end users who may be analytical power users but who are not steeped in the arcane mathematical, statistical, or artificial intelligence wizardry associated with the earliest data mining products.

DataMind runs on Windows 3.1, Windows for Workgroups 3.11, and Windows 95. (I did not try installing it on Windows NT.) The product uses Microsoft Excel for its user interface. The DataMind product family consists of three versions. DataMind Solo is designed for users whose data is stored in spreadsheet or text files. DataMind Professional Edition adds support for ODBC data sources, including RDBMSs. Unlike the other two end-user versions, DataMind DataCruncher is designed for use by MIS staff who create and run data mining studies on a server and then share the results with end users running DataMind Solo or Professional on their desktops. I evaluated DataMind Professional Edition using Windows 95 and Excel 7.0.

Start with a Study

In a nutshell, DataMind analyzes your data to gauge the impact of values in certain fields (called input variables) on values in another field (output variables). The output field usually represents an outcome or business objective. A classic example is to use a field such as credit rating as the output variable and other fields such as income, marital status, education, or even zip code as input variables. The output variable can contain text values, such as "High," "Medium," and "Low," or numeric ratings such as a scale of 1 to 5.

DataMind organizes your work into a study that defines input and output variables, one or more scenarios, and one or more data sources (also referred to as "domains"). A scenario is a collection of input and output variables. For example, one scenario might include all possible input variables, and another might be limited to a small number of variables. A study also includes numerous analytical reports, all of which are stored in a single spreadsheet file. The study and its reports represent an analytical model of your data.

Completing a study involves three procedures: discovery, evaluation, and prediction. The discovery process includes the initial analysis of your data to detect and quantify relationships. The evaluation process compares the derived associations with the actual value in each record's output field to determine how predictable the results really are. The prediction process applies the model to another data set. You will probably perform these steps any number of times to fine-tune the models you build.

Clicking on the DataMind icon launches Excel and starts a wizard that guides you through the creation of a study. After naming your study, you must choose a data source; your options are an ASCII text file, a selected range in an open Excel spreadsheet, an Excel file on disk (possibly exported from another database), or an ODBC data source. Unless your complete database is relatively small, your discovery data set will probably be a sample.

The data wizard's next step shows the field names, their usage, and the number of unique values in each field. Double-clicking on a field name displays these unique values as an indented list below the field name. At this point you typically choose an output field. If you do not, DataMind will proceed, but it will perform a segmentation analysis, a statistical procedure that groups records with similar characteristics. You can also mark fields such as address to be ignored, and you can indicate if a field is discrete or continuous. If a numeric field such as age contains numerous values, you should probably create groups. DataMind automatically creates five groups when you change a field from discrete to continuous. I could not find a way to revise the group boundaries while in the wizard, but the Scenario Specification dialog accessible from the Control Center or menus lets you perform such a revision later.

Exploring a DataMind Model

After the wizard has captured all of your study specifications, DataMind grinds through the data to perform the discovery process. When complete, DataMind displays a graphical image in a spreadsheet. (See Figure 3.) This control center is divided into three vertical panels for the study specification, the discovery, evaluation, and prediction processing steps, and the reports produced by each procedure.

Clicking on a report icon causes DataMind to generate the report in a new sheet within the Excel workbook. The new report sheet becomes active; you can return to the Control Center to view other reports by clicking on the Control Center tab.

A DataMind model includes numerous canned reports. The discovery reports reveal the associations and relationships divined by DataMind. A good starting point is the Discovery Model Summary, which shows each output field's values and the input field values most closely associated with each outcome. (See Figure 4.) Clicking on a "+" icon expands the display to show all input values. A floating toolbar with a single button provides access to additional discovery views that use charts and graphs to display more detail about how each input criteria (specific field-value pairs such as "Gender=Female") affects each output value. Other discovery reports summarize the study specification and the distribution of data used in the study. I found it helpful to study all of these reports to gain a thorough understanding of what DataMind's model is trying to say.

The Excel reports are quite colorful and well designed, but you should become familiar with DataMind's terms in order to interpret the results correctly. (For example, the vague term "specific criteria" indicates input variables that are always associated with an output variable.) You can also generate reports into a Microsoft Word document. DataMind summarizes the study variables and the criteria that affect each output variable's values. The Word reports use natural language statements and include some definitions of DataMind's terms.

Evaluation and Prediction

The evaluation process can cross-check the discovery model by comparing each record's actual output field value to the predicted result. If a high percentage of the predictions are accurate, then it is likely that this model will accurately predict the results of another data set. If too many prediction failures occur, you should fine-tune the specification model by using other scenarios or by modifying the current scenario. DataMind's evaluation reports include an Evaluation Summary and Evaluation Profiles, which summarize the success and failure rates for each outcome, as well as the number of unpredictable cases due to either too few inputs or missing data. Another report of "best profiles" shows the ideal input variable values for each output variable value. For example, you might see that a good credit rating is associated with married homeowners with two children and a high income.

You can also run an evaluation process against another dataset. However, specifying another dataset could be easier. When you run the evaluation process, a dialog asks if you want to use the discovery dataset. Answering "no" does not lead to a dataset selection dialog as I expected; you must dig around in other dialogs.

DataMind performs two types of predictions: batch and case. A batch prediction applies the model to a new dataset. A case prediction lets you examine and manipulate each record.

The Batch Prediction Summary report includes a row for each record and columns for each field (variable), plus additional columns indicating the three most likely predicted outcomes. If you have many input fields, this report will be hard to digest. I also had to click on the "+" icon because not all of the report columns were initially visible.

The Case Prediction option uses a form-like dialog that displays one record at a time, with each field and value listed vertically down the window. The most likely prediction is displayed above the fields, and - because this is a drop-down list - you can also see the second and third most likely predictions. The best part is that you can play what-if games in this window by altering values (such as decreasing income levels) to see the impact of the change. A "Why" button displays another dialog that lists each criteria and its impact on the output variable.

Room for Improvement

Like most version 1.0 releases, DataMind has some room for improvement. This version works only with single tables. It does not yet provide any way to join tables, so you must create queries, views, or extract tables in your source DBMS. (A future version may support joins.) Also, DataMind displays the physical field names of columns, so unless users recognize and understand those names, you will have to change existing column names because DataMind lacks a metadata layer of its own.

The thin user guide - which includes a brief tutorial - explains the basics, but a product that performs a function new to most users should provide more background information and more than one tutorial example. I was unable to find instructions for several tasks such as creating a domain (dataset) in either the manual or the help system.

At press time, DataMind Corp. has plans to release version 1.1 in mid-September. This upgrade will support an unlimited number of discovered relationships for any single output value (version 1.0 can only discover up to 2000 relationships). It will also import text files delimited with commas, spaces, and any user-defined delimiter (version 1.0 imports tab-delimited data). Also, DataCruncher will be available on Windows NT and HP-UX 10.x.

Start Digging with DataMind

If you want to introduce end user analysts to data mining techniques, DataMind is a good program to start with. It is functional enough to be worthwhile, although I would not say that it is mature yet. The Excel interface is polished, attractive, and simple to use for most tasks, but you should make sure your users understand DataMind's terms and how to interpret its reports. When used properly, DataMind can reveal business insights that are hard to obtain using other kinds of analytical software.


FIGURE 3


--The DataMind Control Center appears in an Excel sheet after the Wizard completes the discovery process. Clicking on the Scenario Specification icon lets you revise the study. Clicking on one of the icons in the middle panel initiates a discovery, evaluation, or prediction process. The icons on the right panel display reports that DataMind creates in other sheets within the Excel workbook.


FIGURE 4


--The Discovery Model Summary report quantifies the relationships between the input variables and the output field (Account_Status) values. "Yes" in the Required column means that the input value is always associated with the output value. The Freq. % column tells how often an input criteria occurs with an output criteria. Impact ranks the relative importance of each input criteria.
Maurice Frank is DBMS's editor, based in Marietta, Georgia. You can email Maurice at mfrank@mfi.com.
Table of Contents - October 1996 | Home Page
Copyright © 1996 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to mfrank@mfi.com
Updated Monday, September 23, 1996