DBMS
 

 


Visualizing Data - Sophisticated Graphic Visualization and Development Tools Tailored for Business Applications - By Peter L. Brooks

Our ability to understand data and make business decisions based on it is being undermined by the sheer growth in both the breadth and depth of data in most organizations. New advanced data visualization techniques, often in conjunction with OLAP/ROLAP and data mining tools and processes, are being used to make sense of the enormous amounts of data that are now available for analysis.

Data visualization is the use of graphics to make sense of the reams of data that are available for analysis and decision making. In its simplest form, data visualization is in the form of bar, line, or pie charts. Spreadsheets and OLAP tools are examples of products with simple to moderate data visualization capabilities that are commonly used in business. On the other end of the spectrum are sophisticated scientific applications such as medical imaging or fluid dynamics that require complicated 3D visuals.

The development of these types of graphics requires specialized application and technical knowledge. This article describes the use of a new class of sophisticated graphic visualization and development tools that are being tailored for business applications.

As the need to understand and analyze information increases, the need to explore data advances beyond simple graphics. Advanced data visualization tools, previously used in niche market applications, are gaining the ability to support complex business data exploration and analysis needs. Simple data visualization tools require analysts to view several charts or spreadsheets sequentially to identify complex, multidimensional data relationships of interest. OLAP tools allow a more sophisticated two- and sometimes three-dimensional view of data using a spreadsheet and graphics. Advanced data visualization tools enable analysts to explore complex, multidimensional data in one screen. The promise of these tools is to allow users to explore three, four, or even five dimensions within one graphic on one screen.

An analyst armed with access to a data warehouse, knowledge of statistics, an understanding of the data, and a data visualization tool can find data relationships that would not be readily apparent using simpler tools that deliver only two-dimensional bar, pie, or line charts. Analysts use advanced data visualization capabilities to interrogate, explore, and display data using sophisticated charts, multidimensional images, and numerous screen controls. Animation can often be used to show variations and trends over time.

Data Visualization in the Data Warehouse

Although you can use data visualization tools in a standalone environment, they are most effective when used to analyze information contained in a data warehouse. By providing improved analysis capabilities to warehouse users, the data visualization tool increases the value that is obtained from the warehouse. The data warehouse, by containing cleansed and consistent data, increases the effectiveness of data visualization by minimizing the need for the data visualization tool to perform data scrubbing.

Data mining is one area in which the use of advanced data visualization products is growing. Data visualization, by itself, can be an entire form of data mining application. Following this approach, an analyst would build numerous data displays to determine the most meaningful graphics. The number and type of data displays being generated would depend on the ingenuity and insightfulness of the analyst. In another approach, data visualization can be used as part of an initial exploratory data mining phase to identify domains of interest. Interesting graphics are used to determine the appropriate data mining statistics, modeling, or neural networks analysis to be performed on the data. The data visualization tool can then be used to present the results.

The differentiating factor in using data visualization rather than machine-based discovery data mining methods is that data visualization lets you directly incorporate human ingenuity and analytic capabilities into the data mining process. Other data mining techniques -- machine-based discovery approaches such as statistical regression, rules-based reasoning, and neural networks -- use mathematical calculations to identify interesting data and relationships. Only data visualization uses human cognition as the primary means of value discovery. Data visualization and machine-based discovery techniques form a powerful combination for data mining.

Data Visualization Technology

Two primary types of tools are used to develop advanced data visualization development applications: specialized programming languages and GUI exploration and development tools. Development using data visualization programming languages, which sometimes work in concert with a GUI tool, is performed using the following steps:
  1. Extract all data or a subset of data from its source into the data visualization tool environment.
  2. Explore the data with the data visualization explorer tool.
  3. Identify key visualization needs and user interactions.
  4. Use the programming language to develop customized graphics and user dialogs.
  5. Add the developed applications to the GUI tool menu or accessible library.
Most advanced data visualization GUI tools let developers access and analyze data, select visualization graphics from a predefined set of templates, customize the graphics, and then add the graphics to a library for access by end users. Data visualization development with these tools is usually performed iteratively using the following steps:
  1. Extract all data or a subset of data from its source into the data visualization tool environment. Generally, once data is loaded, all graphics are available for browsing and exploration.
  2. Customize graphic templates that are of interest.
  3. Explore the data by looking at graphics, changing the color scheme, rotating the graphics, and/or zeroing in on interesting areas for further investigation.
  4. Add the selected graphics to the GUI tool menu or accessible library.
Data visualization programming languages are appropriate for cases in which you require specialized visualization graphics above and beyond those that are available in the marketplace. You also must have C or C++ programming abilities. GUI tools are appropriate to use when your data visualization needs do not exceed the graphics capabilities that are available in the tools. Statistical, but not programming, expertise is required.

To provide an understanding of the capabilities and technologies in the marketplace, I will discuss four top representative products in the advanced data visualization marketplace: Visible Decisions Inc.'s (Toronto) Information Animation programming environment, the CG++ customization language used with CrossGraphs from Belmont Research Inc. (Cambridge, Mass.), the SAS/Spectraview high-volume, interactive data visualization tool from SAS Institute Inc. (Cary, N.C.), and the SPSS Diamond GUI development tools from the SPSS Inc. (Chicago). Details on these products and other data visualization tools are provided in the product chart on page 44.

Information Animation from Visible Decisions Inc.

Visible Decisions' Information Animation consists of the Discovery for Developers object-oriented toolkit and Anna, a dynamic interpreted object-oriented programming language. Anna provides standard object-oriented programming facilities such as encapsulation, inheritance, and polymorphism. Anna's syntax is similar to C++, although Anna provides automated memory management and other capabilities that are helpful for building graphical, database-oriented applications. Data objects are stored in the Object Pool object repository and persistent data objects are supported. An extensive class library is provided. Once an application has been created, you can compile and execute it using Discovery Runtime.

Visible Decisions' development methodology consists of the following steps:

  1. Acquire data.
  2. Model data.
  3. Attach views.
  4. Create interactive controllers.
  5. Create a landscape.
Discovery applications are connected to data sources using protocol-specific data servers. Befitting Visible Decisions' expertise in the financial industry, ODBC and flat-file static data servers are complemented by TIB (Teknekron Information Bus, from TIBCO Inc., a division of Reuters Holdings, PLC, Palo Alto, Calif.) and SSL (Reuters Source Sink Library, Reuters Holdings, PLC, London) realtime data servers. You can also create customized data servers using VDI's TCP/IP socket protocol. Discovery commands are built from Anna objects to queue and retrieve data from a server.

To undertake data modeling, you begin by placing data objects in container data structures once you receive them. Data model classes exist for simple structures such as lists or arrays as well as for complex n-dimensional arrays, which are supported by the Matrix class and classes that define the extent of each dimension. Derived data can be calculated, and external C routines can be dynamically loaded to access proprietary and preexisting algorithms. Two-dimensional and 3D user interfaces are tied to data by the use of Active Value datatypes, which allow images to be updated as soon as the associated data values change. A specialized WorkBench data structure supports high-performance mathematical operations against large data sets by optimizing calculations against arrays of similar member types.

You represent models visually by connecting View and Data Models using what is called the Sign container. SmallTalk's Model-View-Controller paradigm is used to maintain independence between Views and Models. Anna classes of cubes, lines, grids, and text are examples of basic built-in Views. These simple classes can be combined to create complex visualizations; you can also import external geometry. View position, orientation, and scaling can all be programatically defined.

Controllers such as sliders and buttons are used to place interface components within 3D Views. When you require more detail in a View, the Drill-Down Brushing lets you superimpose 2D data on 3D graphics for drill-down analysis.

Landscapes are the visual arrangement of views and 3D controllers. A landscape consists of a hierarchy of Sign containers, which are built from data model and view object classes. Different viewing metaphors are available. For example, a flexible "Helicopter" model shows a high-level graphic overview visualization with the ability to zoom to areas of interest. A "virtual salad bowl" model enables the visualization to rotate around a pivot point. Distributed Discovery is an extension to Discovery for Developers that facilitates the rapid creation of Virtual Reality Markup Language landscapes for distribution over intranets or the Internet.

CG++ from Belmont Research

Belmont Research offers two primary data visualization products: CrossGraphs and CG++. CrossGraphs lets you simultaneously explore data by displaying statistical graphics partitioned across selected dimensions. The result, a series of graphs displayed on one screen, is used to understand data relationships that either would not be found by looking at single, simple charts or would require an excessive effort compared to CrossGraphs. Figure 1 (page 42) shows a CrossGraphs preview window displaying bargraphs that show multidimensional retail sales by store, product category, week, and type of promotion.

CG++ is an advanced graphical programming development environment that can be used to create customized graphic visualizations. Windows, Motif, and Apple Macintosh platforms are supported. The CG++ object-oriented programming language is contained within the programming environment. An extensive platform-independent class library and set of programming tools are included. CG++ supports ASCII, dBASE III, SAS dataset (either native or transport format), Oracle (accessed directly), and ODBC (using an installed ODBC driver) data sources.

The key components of the CG++ toolset are GUI Builder, Source Code Editor, Browser, Project Manager, Debugger, Console, Profiler, and C++ Translator. These components work closely with the CG++ compiler, interpreter, and runtime environment.

To develop an application, you perform the following steps:

  1. Create a new project that contains all of the application GUI and source code files.
  2. Define and build the application using the GUI Builder, Source Code Editor, and Browser.
  3. Test and modify the application using the Debugger and Console.
  4. Optimize performance and generate workspaces and/or C++ for shared libraries.
Projects are easily set up using the New Project Type dialog panel. Projects can be started from scratch or built from prior project templates.

The GUI Builder is used to design window instances of the application window class. A Canvas Window and Tools Palette are used to build a window instance. The Canvas Window is the WYSIWYG model of the data entry or display window used in a runtime application. The Tools Palette contains buttons, labels, fields, and other controls that can be included in the end-user GUI window using drag-and-drop.

Each window belongs to a class. Several of the approximately 20 standard data visualization classes include CustomGraphFmtSpec to create custom graph formats, GSColorFmtSpec to select colors, and GSScale to specify the scale factor. A window class contains methods that are directly associated with the class and all of the methods from inherited classes within the class hierarchy. Standard methods for the GSColorFmtSpec class, for example, create color specifications, set color fields, and query the colors of fields.

The source code editor is used to create application code. The Construct Menu eases building of application code by providing a series of templates such as main program definition, class definition, and logical operations such as if else. A debugger supports step-by-step execution, breakpoints, and the ability to access or modify data storage contents. Applications, once debugged, can be compiled into workspaces or libraries.

SAS/Spectraview from SAS Institute

SAS Institute positions its data visualization software, SAS/Spectraview, as a key component in the Explore step of its Sample-Explore-Manipulate-Model-Assess (SEMMA) data mining process. Spectraview is an interactive high-volume visualization tool for viewing, exploring, and analyzing large amounts of multidimensional data. (SAS's Insight product lets you perform interactive data visualization of histograms, scatterplots, box plots, and other statistical graphics against smaller amounts of data with under 10,000 observations.) SAS/Spectraview lets you view predictive models as well as the correlation and distribution of data patterns. Volumes, isometric surfaces, and cutting planes can be enlarged and rotated to best identify interesting data relationships.

SAS/Spectraview consists of three major functions to produce advanced data visualization: data loading and filtering, image coloring, and volume visualization. All functions are performed using pull-down menu bar commands.

Data is loaded into Spectraview from a source data set consisting of at least four numeric variables -- one response variable and the X, Y, and Z independent variables. For example, one sample SAS data set includes the "mortgage payment" response variable based on loan interest rate, loan amount, and number of years in the loan. A "BY" numeric variable can also be specified so that results can be visualized across one more measure. A common by variable is a time measure that is used to produce graphic animation over time. This would allow the mortgage payment relationships to be visualized as they have changed over the last 20 years.

Spectraview can access data from SAS datasets, the new SAS DMDBs (Data Mining Databases), and major relational databases. To access relational databases, SAS Data Access engines are used to create relational views that appear to the Spectraview visualization objects as SAS datasets. Support for the SAS MDDB is not yet available but planned for a future release.

You can customize the color of data, text, missing values, and other image attributes. The most interesting technique is specifying data value colors. Users map specific colors to specific data values by using a data ramp. The data ramp contains a column of color buttons that are displayed right beside a column of the evenly spaced (not necessarily actual response values) data value buttons. Clicking on the color button next to a data value lets you use an RGB slider or color palette to select the color associated with the selected data value. Figure 2 (page 42) shows a Spectraview BarChart object with controls to visualize employee health data. At the bottom is the color selector object used to assign the colors.

Once data is loaded into Spectraview, a bounding box (consisting of the outlines of the 3D data visualization surface that contains all variable values) is displayed and volume visualization can proceed. All data visualization manipulation is done within the confines of the bounding box. You can view data as solid volumes where spaces between data values are filled in or as data clouds where each discrete data value is viewed. Users can select specific data values by creating a cutting plane through the volume or by probing for specific response values. You can rotate 3D images and perform animation of images by automatically moving through all of the by variable values.

You can fine-tune graphs by using rendering controls. "Rendering" is a term used to describe the conversion of a set of points to presentation graphics. You can control the rendering of data points in Spectraview by both opacity degree and splatting width. Opacity degree determines which colored points are to be displayed. Splatting width determines the size of the rendered data points.

Diamond from SPSS

Diamond is the data visualization tool from SPSS, a vendor well known for its statistical modeling and analysis products. Diamond is a high-dimension (more than three) interactive data visualization tool intended both for data exploration by statisticians developing hypotheses appropriate for more detailed analysis and for the presentation of complex relationships to audiences with less statistical sophistication.

There are four phases in the use of Diamond:

Each invocation of Diamond is an "instance" that operates on one data set. Using a single screen dialog, you can build one instance to feed data to another invocation so that you can explore interesting subsets of data using additional statistical analyses.

SPSS, Systat System files, SAS, and BMDP (BioMedical Data Programs) files can be opened directly by SPSS Diamond. You can select specific data variables and data values from the native data set definitions. You can select up to 100 active variables. You can also import data from ASCII, Microsoft Excel, and Lotus 1-2-3 spreadsheets and then store it in an SPSS data set. Relational databases are not directly read by Diamond. Relational data can be read by SPSS and stored in SPSS data sets that Diamond can access, or relational data can be exported to data sets that Diamond can import.

Tailoring data consists of applying color ranges to data set variable values, scaling and transforming data, and defining new variables. You can apply colors by using red, green, and blue primary color brushes. Combinations of these primary colors are used to produce secondary colors. You can apply default colors to data values by simply clicking on a graph of the data. You apply colors to specific data values by clicking on a brush icon to choose a color and then building a box around specific data values in a graph to associate the chosen color to the data points.

To better visualize the data, over 35 transforms are available to you, including use of different scales such as logarithms or exponents, spreading out of extreme values, and adding randomized noise to data. Finally, for additional flexibility, you can define new variables using primitive operatives or build sophisticated equations using functions such as sines/cosines, high/low selections, and cube roots from SPSS's Equation Function Builder library.

You can visualize data using various data presentation capabilities. Once a data set is opened, the following data presentation windows are available: scatterplots and histograms, pairwise (that is, scatterplots with bivariate statistics and histograms), triplewise (namely, 3D scatterplots), quadwise (two scatterplots with linked lines between corresponding points), or parametric snakes, which are scatterplots of two variables overlayed by lines in the order of a third variable that can feature animation capabilities. You can also present data in parallel coordinates (which are multiple parallel lines, each of which represents separate dimensions) and fractal foam, in which data values and relationships are represented using bubbles.

Using these graphics, many data exploration techniques are available to you. You can analyze categories of data values -- missing values, repeated values, and outlying values, for example. Scatterplots and histograms will show variable distributions; a parallel coordinate graphic is a great way to show correlation between variables. Animated triplewise 3D scatterplot diagrams let you visualize three variable dimensions, not just two. Animation, which will automatically twirl a triplewise picture, can be stopped at interesting points.

Visualize the Future

Data visualization graphics and techniques are being used to present information to users in new and novel ways. Advanced data visualization tools are designed to provide graphics beyond those of the simple business charts that can be created by Visual Basic, PowerBuilder, spreadsheets, and OLAP tools. 3D graphics, realtime animation, and intense user interaction and ability to customize graphics are some of the characteristics of these tools.

Users of advanced data visualization tools generally require training in statistical analysis and should have a deep understanding of the data being analyzed. This is unlike tools with simpler visualization capabilities that are easily understood by the casual user. Strong statistical analysis skills and detailed data knowledge are required to build such graphics.

These advanced data visualization tools generally support access to their own proprietary data set formats, relational databases using ODBC, and other common statistical data set formats such as those from SAS. I do not know of any advanced data visualization tool that accesses data in a MDDB. There continue to be "islands of visualization" as OLAP and advanced data visualization tools cannot share data between applications, although it appears to me that users could obtain significant value from the integration of these capabilities.

The performance of these tools is directly related to the size of the data being processed, the complexity of processing, and the hardware platform being used. Over time, performance will certainly increase, but currently most tools are constrained by the previously mentioned factors. Data sampling, simplifying graphics, and using slower animation are techniques currently used to overcome the performance inhibitors.

As the volume of available data grows and business graphics become commonplace, an advantage will accrue to the organizations that are able to more quickly make sense of their data -- a capability that requires human involvement and interpretation. Even when using machine-based discovery data mining techniques that can process vast quantities of data, you must analyze the results -- the answer does not automatically appear. Advanced data visualization allows for the interactive interpretation and analysis of large amounts of data that cannot be derived from columns of numbers and that is not effective when displayed in simple charts.


Peter L. Brooks is a management consultant with the Advanced Technology Group of Coopers & Lybrand Consulting, based in Boston. He specializes in helping organizations achieve strategic business value by applying solutions that include business intelligence systems, data warehousing, and Internet technologies. You can email Peter at PLBrooks@compuserve.com.
See accompanying product chart for contact information on the companies and products listed in this article.


Figure 1.


--A CrossGraphs preview window displaying bar graphs that show multidimensional retail sales data by store, product category, week, and type of promotion. (Courtesy Belmont Research Inc.)

Figure 2.


--A SAS/Spectraview BarChart object with controls to visualize employee health data. The data includes observations for age, height, weight, fat, cholesterol, and blood pressure; at the bottom is the color selector object used to assign the colors. (Courtesy SAS Institute Inc.)


What did you think of this article?
Send a letter to the editor.


Subscribe to DBMS and Internet Systems -- It's free for qualified readers in the United States
August 1997 Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS and Internet Systems (http://www.dbmsmag.com)
Copyright © 1997 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated Thursday, July 10, 1997