Our ability to understand data and make business decisions based on it is being undermined by the sheer growth in both the breadth and depth of data in most organizations. New advanced data visualization techniques, often in conjunction with OLAP/ROLAP and data mining tools and processes, are being used to make sense of the enormous amounts of data that are now available for analysis.
Data visualization is the use of graphics to make sense of the reams of data that are available for analysis and decision making. In its simplest form, data visualization is in the form of bar, line, or pie charts. Spreadsheets and OLAP tools are examples of products with simple to moderate data visualization capabilities that are commonly used in business. On the other end of the spectrum are sophisticated scientific applications such as medical imaging or fluid dynamics that require complicated 3D visuals.
The development of these types of graphics requires specialized application and technical knowledge. This article describes the use of a new class of sophisticated graphic visualization and development tools that are being tailored for business applications.
As the need to understand and analyze information increases, the need to explore data advances beyond simple graphics. Advanced data visualization tools, previously used in niche market applications, are gaining the ability to support complex business data exploration and analysis needs. Simple data visualization tools require analysts to view several charts or spreadsheets sequentially to identify complex, multidimensional data relationships of interest. OLAP tools allow a more sophisticated two- and sometimes three-dimensional view of data using a spreadsheet and graphics. Advanced data visualization tools enable analysts to explore complex, multidimensional data in one screen. The promise of these tools is to allow users to explore three, four, or even five dimensions within one graphic on one screen.
An analyst armed with access to a data warehouse, knowledge of statistics, an understanding of the data, and a data visualization tool can find data relationships that would not be readily apparent using simpler tools that deliver only two-dimensional bar, pie, or line charts. Analysts use advanced data visualization capabilities to interrogate, explore, and display data using sophisticated charts, multidimensional images, and numerous screen controls. Animation can often be used to show variations and trends over time.
Data mining is one area in which the use of advanced data visualization products is growing. Data visualization, by itself, can be an entire form of data mining application. Following this approach, an analyst would build numerous data displays to determine the most meaningful graphics. The number and type of data displays being generated would depend on the ingenuity and insightfulness of the analyst. In another approach, data visualization can be used as part of an initial exploratory data mining phase to identify domains of interest. Interesting graphics are used to determine the appropriate data mining statistics, modeling, or neural networks analysis to be performed on the data. The data visualization tool can then be used to present the results.
The differentiating factor in using data visualization rather than machine-based discovery data mining methods is that data visualization lets you directly incorporate human ingenuity and analytic capabilities into the data mining process. Other data mining techniques -- machine-based discovery approaches such as statistical regression, rules-based reasoning, and neural networks -- use mathematical calculations to identify interesting data and relationships. Only data visualization uses human cognition as the primary means of value discovery. Data visualization and machine-based discovery techniques form a powerful combination for data mining.
To provide an understanding of the capabilities and technologies in the marketplace, I will discuss four top representative products in the advanced data visualization marketplace: Visible Decisions Inc.'s (Toronto) Information Animation programming environment, the CG++ customization language used with CrossGraphs from Belmont Research Inc. (Cambridge, Mass.), the SAS/Spectraview high-volume, interactive data visualization tool from SAS Institute Inc. (Cary, N.C.), and the SPSS Diamond GUI development tools from the SPSS Inc. (Chicago). Details on these products and other data visualization tools are provided in the product chart on page 44.
Visible Decisions' development methodology consists of the following steps:
To undertake data modeling, you begin by placing data objects in container data structures once you receive them. Data model classes exist for simple structures such as lists or arrays as well as for complex n-dimensional arrays, which are supported by the Matrix class and classes that define the extent of each dimension. Derived data can be calculated, and external C routines can be dynamically loaded to access proprietary and preexisting algorithms. Two-dimensional and 3D user interfaces are tied to data by the use of Active Value datatypes, which allow images to be updated as soon as the associated data values change. A specialized WorkBench data structure supports high-performance mathematical operations against large data sets by optimizing calculations against arrays of similar member types.
You represent models visually by connecting View and Data Models using what is called the Sign container. SmallTalk's Model-View-Controller paradigm is used to maintain independence between Views and Models. Anna classes of cubes, lines, grids, and text are examples of basic built-in Views. These simple classes can be combined to create complex visualizations; you can also import external geometry. View position, orientation, and scaling can all be programatically defined.
Controllers such as sliders and buttons are used to place interface components within 3D Views. When you require more detail in a View, the Drill-Down Brushing lets you superimpose 2D data on 3D graphics for drill-down analysis.
Landscapes are the visual arrangement of views and 3D controllers. A landscape consists of a hierarchy of Sign containers, which are built from data model and view object classes. Different viewing metaphors are available. For example, a flexible "Helicopter" model shows a high-level graphic overview visualization with the ability to zoom to areas of interest. A "virtual salad bowl" model enables the visualization to rotate around a pivot point. Distributed Discovery is an extension to Discovery for Developers that facilitates the rapid creation of Virtual Reality Markup Language landscapes for distribution over intranets or the Internet.
CG++ is an advanced graphical programming development environment that can be used to create customized graphic visualizations. Windows, Motif, and Apple Macintosh platforms are supported. The CG++ object-oriented programming language is contained within the programming environment. An extensive platform-independent class library and set of programming tools are included. CG++ supports ASCII, dBASE III, SAS dataset (either native or transport format), Oracle (accessed directly), and ODBC (using an installed ODBC driver) data sources.
The key components of the CG++ toolset are GUI Builder, Source Code Editor, Browser, Project Manager, Debugger, Console, Profiler, and C++ Translator. These components work closely with the CG++ compiler, interpreter, and runtime environment.
To develop an application, you perform the following steps:
The GUI Builder is used to design window instances of the application window class. A Canvas Window and Tools Palette are used to build a window instance. The Canvas Window is the WYSIWYG model of the data entry or display window used in a runtime application. The Tools Palette contains buttons, labels, fields, and other controls that can be included in the end-user GUI window using drag-and-drop.
Each window belongs to a class. Several of the approximately 20 standard data visualization classes include CustomGraphFmtSpec to create custom graph formats, GSColorFmtSpec to select colors, and GSScale to specify the scale factor. A window class contains methods that are directly associated with the class and all of the methods from inherited classes within the class hierarchy. Standard methods for the GSColorFmtSpec class, for example, create color specifications, set color fields, and query the colors of fields.
The source code editor is used to create application code. The Construct Menu eases building of application code by providing a series of templates such as main program definition, class definition, and logical operations such as if else. A debugger supports step-by-step execution, breakpoints, and the ability to access or modify data storage contents. Applications, once debugged, can be compiled into workspaces or libraries.
SAS/Spectraview consists of three major functions to produce advanced data visualization: data loading and filtering, image coloring, and volume visualization. All functions are performed using pull-down menu bar commands.
Data is loaded into Spectraview from a source data set consisting of at least four numeric variables -- one response variable and the X, Y, and Z independent variables. For example, one sample SAS data set includes the "mortgage payment" response variable based on loan interest rate, loan amount, and number of years in the loan. A "BY" numeric variable can also be specified so that results can be visualized across one more measure. A common by variable is a time measure that is used to produce graphic animation over time. This would allow the mortgage payment relationships to be visualized as they have changed over the last 20 years.
Spectraview can access data from SAS datasets, the new SAS DMDBs (Data Mining Databases), and major relational databases. To access relational databases, SAS Data Access engines are used to create relational views that appear to the Spectraview visualization objects as SAS datasets. Support for the SAS MDDB is not yet available but planned for a future release.
You can customize the color of data, text, missing values, and other image attributes. The most interesting technique is specifying data value colors. Users map specific colors to specific data values by using a data ramp. The data ramp contains a column of color buttons that are displayed right beside a column of the evenly spaced (not necessarily actual response values) data value buttons. Clicking on the color button next to a data value lets you use an RGB slider or color palette to select the color associated with the selected data value. Figure 2 (page 42) shows a Spectraview BarChart object with controls to visualize employee health data. At the bottom is the color selector object used to assign the colors.
Once data is loaded into Spectraview, a bounding box (consisting of the outlines of the 3D data visualization surface that contains all variable values) is displayed and volume visualization can proceed. All data visualization manipulation is done within the confines of the bounding box. You can view data as solid volumes where spaces between data values are filled in or as data clouds where each discrete data value is viewed. Users can select specific data values by creating a cutting plane through the volume or by probing for specific response values. You can rotate 3D images and perform animation of images by automatically moving through all of the by variable values.
You can fine-tune graphs by using rendering controls. "Rendering" is a term used to describe the conversion of a set of points to presentation graphics. You can control the rendering of data points in Spectraview by both opacity degree and splatting width. Opacity degree determines which colored points are to be displayed. Splatting width determines the size of the rendered data points.
There are four phases in the use of Diamond:
SPSS, Systat System files, SAS, and BMDP (BioMedical Data Programs) files can be opened directly by SPSS Diamond. You can select specific data variables and data values from the native data set definitions. You can select up to 100 active variables. You can also import data from ASCII, Microsoft Excel, and Lotus 1-2-3 spreadsheets and then store it in an SPSS data set. Relational databases are not directly read by Diamond. Relational data can be read by SPSS and stored in SPSS data sets that Diamond can access, or relational data can be exported to data sets that Diamond can import.
Tailoring data consists of applying color ranges to data set variable values, scaling and transforming data, and defining new variables. You can apply colors by using red, green, and blue primary color brushes. Combinations of these primary colors are used to produce secondary colors. You can apply default colors to data values by simply clicking on a graph of the data. You apply colors to specific data values by clicking on a brush icon to choose a color and then building a box around specific data values in a graph to associate the chosen color to the data points.
To better visualize the data, over 35 transforms are available to you, including use of different scales such as logarithms or exponents, spreading out of extreme values, and adding randomized noise to data. Finally, for additional flexibility, you can define new variables using primitive operatives or build sophisticated equations using functions such as sines/cosines, high/low selections, and cube roots from SPSS's Equation Function Builder library.
You can visualize data using various data presentation capabilities. Once a data set is opened, the following data presentation windows are available: scatterplots and histograms, pairwise (that is, scatterplots with bivariate statistics and histograms), triplewise (namely, 3D scatterplots), quadwise (two scatterplots with linked lines between corresponding points), or parametric snakes, which are scatterplots of two variables overlayed by lines in the order of a third variable that can feature animation capabilities. You can also present data in parallel coordinates (which are multiple parallel lines, each of which represents separate dimensions) and fractal foam, in which data values and relationships are represented using bubbles.
Using these graphics, many data exploration techniques are available to you. You can analyze categories of data values -- missing values, repeated values, and outlying values, for example. Scatterplots and histograms will show variable distributions; a parallel coordinate graphic is a great way to show correlation between variables. Animated triplewise 3D scatterplot diagrams let you visualize three variable dimensions, not just two. Animation, which will automatically twirl a triplewise picture, can be stopped at interesting points.
Users of advanced data visualization tools generally require training in statistical analysis and should have a deep understanding of the data being analyzed. This is unlike tools with simpler visualization capabilities that are easily understood by the casual user. Strong statistical analysis skills and detailed data knowledge are required to build such graphics.
These advanced data visualization tools generally support access to their own proprietary data set formats, relational databases using ODBC, and other common statistical data set formats such as those from SAS. I do not know of any advanced data visualization tool that accesses data in a MDDB. There continue to be "islands of visualization" as OLAP and advanced data visualization tools cannot share data between applications, although it appears to me that users could obtain significant value from the integration of these capabilities.
The performance of these tools is directly related to the size of the data being processed, the complexity of processing, and the hardware platform being used. Over time, performance will certainly increase, but currently most tools are constrained by the previously mentioned factors. Data sampling, simplifying graphics, and using slower animation are techniques currently used to overcome the performance inhibitors.
As the volume of available data grows and business graphics become commonplace, an advantage will accrue to the organizations that are able to more quickly make sense of their data -- a capability that requires human involvement and interpretation. Even when using machine-based discovery data mining techniques that can process vast quantities of data, you must analyze the results -- the answer does not automatically appear. Advanced data visualization allows for the interactive interpretation and analysis of large amounts of data that cannot be derived from columns of numbers and that is not effective when displayed in simple charts.

