DBMS

Predicting Credit Risk


DBMS, Data Mining Solutions Supplement 1998

One challenge data miners face is choosing which technique to apply to an analysis task. In many situations, analysts can use a variety of techniques, but each technique conducts the analysis and represents the results differently. For consistency, we will use a single example to illustrate the data mining techniques explained in the other articles in this publication.

Consider a financial institution such as a bank or a credit union that seeks to minimize loan defaults. Loan officers must be able to identify potential credit risks during the loan approval cycle. The problem is one of simple classification: to predict whether or not an applicant will be a good or poor credit risk.

Table 1 shows the training dataset for this problem. For simplicity the size has been limited to five records ý a very small dataset.

NameDebtIncomeMarried?Risk
JoeHighHighYesGood
SueLowHighYesGood
JohnLowHighNoPoor
MaryHighLowYesPoor
FredLowLowYesPoor

Table 1. Credit risk training dataset. Debt, income, and marital status are the independent variables. Credit risk is the dependent variable or the outcome.

This dataset contains information about people to whom the institution previously loaned money. The lender determined if each applicant was a good or poor credit risk. Because Risk is what we wish to predict, it is the dependent column, also called the target variable. The other columns used to build the model are known as independent columns. In this example they include debt level (High or Low), income level (High or Low), and marital status (Yes or No). The Name column will be ignored because it is highly unlikely that a personýs name affects his credit risk. Even if it were relevant, it would not be used because it is unique (or almost so). The same would be true of record IDs or other identifiers such as social security numbers.

In our example, all the columns, except Name, have two possible values. (As with the record limitation, the restriction to two values is only to keep the example simple.) The term categorical refers to columns that can only contain a limited set or predefined values. In contrast, a continuous value is one that can be any value within a continuum. If income were recorded in whole dollar amounts ranging from $0 to $100,000, that would be a continuous value. As we will see, many data mining problems require binning or grouping continuous values into categorical values. Income categories could be $0 to $10,000, $10,001 to $20,000, and so forth.

The training data is a sample of the complete dataset. Additional datasets for test, control and validation would have the same columns.


Estelle Brand (estelle@xore.com) and Rob Gerritsen (rob@xore.com) are founders of Exclusive Ore Inc., based in Blue Bell, Pennsylvania, which is a consulting and training company specializing in data mining. During the last two years they have used more than a dozen data mining products. Their database management systems experience dates back to the dark ages. For more information about Exclusive Ore and data mining, see www.xore.com.
What did you think of this article? Send a letter to the editor.


Subscribe to DBMS -- It's free for qualified readers in the United States
Data Mining Solutions Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS (http://www.dbmsmag.com)
Copyright © 1998 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated February 26, 1998