DBMS
Building a DBA Repository System
By Martin Rennhackkamp
DBMS, January 1996
Applying Application-Centric Repository Principles to Physical Database Management.
In this two-part series, I explain how you can use repository technology for physical database management. In this first part, I provide an overview of repository technology and the challenges of physical database management. I also discuss the shortcomings of existing repository-based tools with respect to physical database management, and describe how repository technology is applicable to physical database management. Next month, I will describe a repository-based system that my company has developed for physical database management.
Repository Technology
A repository system's purpose is to record and manage the information about the data models and the designs and implementations of the enterprise databases, applications, and systems. In other words, the repository system records and manages the metadata of the enterprise. The repository is a knowledge base that contains metadata about the information an enterprise keeps and accesses, the processes it performs to be successful, and how it has implemented its application systems to support its information processing needs. A repository system is illustrated in Figure 1.
Thus, a repository system is a database-centric application that manages corporate metadata and therefore its data and application systems. The result is an integrated and manageable corporate information resource. Repository technology is a key component in the integration of the enterprise's information environments. It is the technology used to integrate different data models, independent databases, diverse and heterogeneous database technologies, user applications, CASE tools, reference and directory management systems, systems administration, and object management services. The repository system presents a single, consistent view of the metadata describing all of these pieces.
Content
The metadata stored in the repository system typically includes the following:
- Data dictionary metadata includes relations (such as tables), keys, attribute details, formats, integrity constraints, and business rules applied to database structures.
- Design metadata includes requirements documents, design documents, architectural diagrams, data flow diagrams, structure charts, hierarchy charts, and project plans. This includes logical and physical data models, business rules (for example, obtained from forward- or reverse-engineering), and integrity rules.
- Application development metadata includes version-control data, source code, test scripts, and quality assurance reports (including physical data structures for DBMSs and 4GLs), as well as query documentation for end users.
The repository system must not only track metadata such as physical data elements, screens, forms, and tables, but it must also track business rules, in which some data elements might be related to other data elements. For example, it must track which data elements appear on which forms, screens, and tables. Therefore, the information about the relationships between the data elements and the screens should also be maintained. The metamodel must contain all of the valid associations, in addition to the objects and properties that make up the metamodel.
Facilities
The repository system's facilities are implemented as database applications operating on normalized sets of databases, object bases, knowledge bases, and files to manage the metadata at all levels of abstraction. These facilities, which provide standardized services for data access, data sharing, and data management, include:
- Modeling tools (that is, CASE tools) for recording and analyzing enterprise knowledge in the form of data and process models.
- Library facilities, such as dictionaries, encyclopedias, thesauruses, and glossaries.
- Directory management facilities to record data addresses and attributes, in order to provide naming services.
- Systems administration facilities to install and manage information databases.
- Repository management facilities, such as object management services, time services, and interfaces to other services.
- Interactive interfaces with DBMSs, 4GLs, 3GLs, precompilers, and other software components, such as scheduling packages, all of which provide the ability to cross-reference and synchronize jobs, programs, and data definitions.
Functions
A repository system must support the storage, retrieval, version management, and configuration control of all the metadata used for the many aspects of application development. It must provide the user applications with a consistent scheme for managing data in a distributed information processing environment. The application processes must be able to share data across heterogeneous computer networks, yet remain insulated from the details of interaction with multiple types of hardware and software. The repository system must provide a uniform services interface to application processes, and it must maintain knowledge of the physical location of the data, as well as control the distribution thereof.
A repository system must provide the following functions:
- Support and encourage repository sharing resources across multiple business units throughout the enterprise.
- Provide concurrent, multiuser access.
- Provide a robust reporting environment to users and technical specialists.
- Store multiple process and data models of business units with multiple versions of each.
- Support a metamodel that is adaptable to change and enforces data integrity constraints.
- Provide a solution consistent with existing frameworks and standards, such as IBM's Repository Manager framework, John Zachman's Information Systems Architecture (ISA) framework, and various ANSI standards (for example, the ANSI/X3/SPARC three-schema architecture, some of the ANSI SQL recommendations that address system catalog interfaces, and the ANSI standards for an Information Systems Resource Dictionary [IRDS]).
- Maintain descriptions of development objects for analysis, design, code generation, help screens, and recovery.
- Maintain change-management histories.
- Maintain and generate 4GL cross-references and program documentation.
- Maintain component engineering information.
- Provide the ability to add object types and define their interaction (objects such as programs, systems, jobs, files, fields, and reports integrate with each other).
- Provide the ability to add indexing words to objects for searching.
- Know which repository facilities are running and accessible on the network.
- Know what data resides on the network, how to retrieve it, and whether the specified data is currently retrievable.
- Enforce security constraints on network-based repository resources on the network.
- Provide Code generation.
- Provide Impact analysis.
Some code, such as data definition statements, can be automatically generated from the information stored in the repository -- as long as the repository and the DBMS are tightly integrated. The input for physical design is in the repository (definitions, names, lengths, and formats). With these elements, you can create constructs for testing and later move them to production. Because database definitions are usually DBMS-specific, the DBA will have much less work to do if the repository can address the DBMS in its own terms (for example if the repository can generate the data definitions in a syntax supported by the DBMS, or, even better, if it can run the database definitions directly against the databases managed by the DBMS).
The repository's role is to identify the impact made when programs alter the way they use data or when the data structures change. An analyst must be able to determine the impact of these changes, even if the only resource is a list of potentially affected relationships. Cross-referencing is good for more than just impact analysis. Using even a partial cross-reference with the actual program descriptions results in "asset management"; that is, the ability to increase the reusability of subroutines or other program code. Impact analysis should also examine the nature of the following business rules affected by the change:
- Entity instance relationships, typically enforced through the DBMS's referential integrity feature.
- Data integrity rules, such as domains, attributes, dependencies, and derivation rules.
- State transition rules, which determine the allowable changes for each entity instance.
- Event processing rules, which control how processes are invoked to satisfy the requirements of a specific business event based on the associated entity instance's state.
- Delivery rules, which govern the external interfaces such as screens, dialogs, and reports.
In multi-release impact analysis, you must perform the same impact analysis for each application version in order to determine which application components have been affected by a change. Another change-management challenge occurs when the release schedule is restructured. When this occurs, you must conduct careful impact analysis.
Physical Database Management
Database management, in the generic sense, covers the entire life cycle of all of an organization's databases, from requirements definition through logical database design to physical database design, implementation, and performance tuning on various target databases. Logical database design is performed by a data administrator. This administrator documents the logical data model, typically using a CASE tool. The goal is to represent the users' business needs as a consistent and correct logical data model. The data administrator conducts JAD sessions with users to establish a global logical data model, which is sometimes partitioned for particular application areas.
A DBA performs physical database management. Physical database management carries the logical data model forward to physical design. The goal of physical database management is to implement and manage the logical data model as an efficient, consistent, and correct physical data model.
A large organization typically has a number of databases, often in various stages of development (for example, regional production databases in its regional offices, a central production database in its head office, an evolving warehouse database for statistical processing, and various test and development databases). Developers must manage the implementation of the physical data model on the various databases of the enterprise.
Obviously, physical database management is complex. DBAs must manage multiple physical implementations of the same or different logical data models in different databases, and concentrate on tuning various aspects of these different databases to improve performance. In addition to the widely diverging performance tuning aspects, DBAs must also manage various other implementation-related objects, such as triggers, user-defined procedures, rules, events, and views.
As an example, see Tables 1 and 2. In the sales branch's database, the Employee table must be indexed on the EmployeeName column, preferably using a b-tree or ISAM structure that allows partial key searches. In the head-office database, the Employee table must be indexed on the unique EmployeeID key column and on the repeating Department foreign-key column, both preferably using some hash structure that allows direct key look-ups. In the warehouse database, however, the view (see Table 3) of these two tables would be used.
In addition, on the branch databases, the DBA must implement the business rule that employees in the sales department may not take more than 30 days leave, while employees in other branches may not take more than 25 days leave. On the head-office database, the DBA must also implement the business rule that there may only be a total of 25 employees in all the sales departments country-wide, and only eight employees in the R&D department country-wide. Meanwhile, in the IT section, the programmers are changing the human resources system to work with the Employee table (see Table 4), which would cater to new taxation laws.
Available Tools
The database management function must manage complex, interrelated objects such as table definitions, integrity constraint definitions, business rules, views, stored procedures, and performance improvement facilities. In an environment with multiple DBMSs, this is even more complex because each vendor's implementation of these database objects differs. For example, CA-OpenIngres implements business rules through stored rules and procedures, while Oracle and Sybase implement business rules through triggers. However, Oracle's and Sybase's SQL syntax triggers differ significantly.
Software developers work in a slightly different environment than that of a DBA. They work with different versions of program code, written in different programming languages (such as C, C++, Cobol, Visual Basic, and various 4GLs), addressing different target systems (such as Windows for DOS and Motif for Unix), packaged in different releases, and targeted for deployment on different development, test, and production systems, in ever-changing release cycles. Software developers can, however, use repository-based tools to develop and manage the complex environment of interrelated software components in various stages of development and deployment.
Most repository-based tools, including CASE tools, address only the subject area of the data administrator. Most CASE tools have the facilities to generate simple ANSI SQL database definition scripts, and most even have the facilities to generate SQL database definition scripts for specific target systems, such as CA-OpenIngres, Oracle, and Sybase. However, few of these CASE tools address the full complex environment managed by the DBA. Very few CASE tools provide for different logical data models on different databases, and even fewer provide for different physical models or implementations on different databases.
One of the CASE tools that has made the most progress in this area is Sybase's S-Designor. It lets you map a logical data model (defined in terms of entities, attributes, and relationships) to various physical data models (defined in terms of tables, columns, and constraints), as supported by different DBMSs (for example SQL table definitions in the SQL dialects supported by Sybase, Oracle, and CA-OpenIngres). For a discussion and a listing of CASE tools, see "The Evolution of Client/Server CASE," on page 67.
Most CASE tools will allow multiple versions of the Employee table. This approach is very useful for the evolution of the Employee table used in the branch and head-office databases. Few CASE tools, however, let you associate versions 1 and 3 with the branch and head-office databases, while version 2 is deployed on the warehouse database.
Very few CASE tools provide for physical tuning aspects, such as table structures, fill factors, and re-indexing scripts. For example, few CASE tools make the distinction that: tables can be structured as b-tree, hash, heap, or ISAM structures in CA-OpenIngres, and as b-tree or hash structures in Oracle; you can specify a table preallocation size in CA-OpenIngres; and you must specify a table preallocation size in Oracle. Very few CASE tools address the extended features provided by the more advanced RDBMSs such as CA-OpenIngres, Oracle, and Sybase. These include objects such as procedures, rules, triggers, events, and views. For example, few CASE tools let you construct a view as a select query about existing tables and other views, with full cross-referencing information on which columns are used in which views. Finally, few CASE tools let you capture stored procedure and trigger definitions, let alone address the subtle differences between implementation and syntax (for example, between CA-OpenIngres, Oracle, and Sybase).
A few repository-based CASE tools do address the interface between logical database design and physical database implementation through simple SQL script-generation facilities. However, the database management function must still use the tools and facilities of the operating systems and the DBMSs to manage physical database management in the complex database environment.
Applying Repository Technology
Despite the challenges, it makes perfect sense to utilize repository technology for physical database management. If you consider the contents of the repository system, the following applies to physical database management:
- The database objects described by the data dictionary metadata -- in other words, the relations (that is, tables), keys, attribute details, formats, integrity constraints, and business rules -- are the same objects managed by the DBA. The focus just changes from the objects accessed by the application programs to the primary objects being managed. This includes adding all the implementation and deployment details of all these various objects.
- The design metadata still includes requirements documents, design documents (such as entity-relationship diagrams, relation diagrams, entity-life histories, data fragmentation, and allocation schemas), data flow diagrams (these still represent the program entities accessing the database objects), structure charts, hierarchy charts, and project plans. For the DBA, the focus shifts once again to the physical implementation details (that is, the differences between the operational systems running on CA-OpenIngres and the warehouse system running on Oracle), the physical deployment details (which version of which table is deployed on which database, and which database is implemented using which DBMS), and physical design details (that is, the different indexes used in the branch and head-office databases).
- The application development metadata still includes versions of applications and queries accessing the various database objects, including references to control data, source code, test scripts, and quality assurance reports. The metadata also includes physical database structures implemented in the different DBMSs to cater to the application logic implemented in the various queries, 3GL, and 4GL programs.
When you consider the functions provided by a repository system, the following applies to physical database management:
- The repository resources, including all of the database object definitions, are shared across multiple business units throughout the corporation. All of the DBAs in the various projects and departments must share one global logical data model, with its various physical implementation details.
- DBAs, data administrators, and application developers must have concurrent, multiuser access to all the metadata used for physical database management.
- The DBAs, data administrators, and application developers must be able to report on the physical database management metadata from any perspective.
- The repository system must store multiple physical data models and the process specifications used by all business units, maintaining multiple versions of each as they evolve and as they are applied to different physical databases.
- The meta-model used for physical database management, similar to its application-centric counterpart, must be adaptable to change. It must also enforce data integrity constraints to ensure that the metadata used for physical database management is correct and consistent.
- The metadata used for physical database management must also support existing frameworks and standards, such as John Zachman's ISA framework and the ANSI SQL standard.
- Descriptions of all the database objects must be maintained for analysis, design, code generation, help screens, and recovery.
- Change management histories of all the database objects must be maintained to illustrate how they have evolved over time (to provide for roll back to previous releases, for example).
- You must be able to generate 4GL and program cross-references and documentation (such as which tables and views are used in which queries, which stored procedures are called by which applications, and so on).
- The component engineering information used for physical database management is extensive. Tables and views are used in view definitions and procedure definitions, all of the database objects are grouped into schemas, schemas are deployed on logical databases (such as a branch database), and logical databases are implemented as physical databases (such as the branch database in the Miami sales office).
- The ability to add database object types and define their interaction is also important. DBMSs are constantly evolving and constantly supporting more and more diverse types of objects stored in the database. Examples include declarative referential constraints, stored procedures, triggers, and database events. Current extensions provide for objects, persistent objects, query execution plans, optimizer statistics, and distributed database object definitions.
- Similar to its application-centric counterpart, it should be possible to add indexing words to the managed objects and search on the basis of those indexes.
- The repository system must know which database management facilities are running on the network and whether they are currently accessible. The system must also know what data resides on the network, how to retrieve it, and whether the specified data is currently retrievable.
- The repository system must enforce security constraints on the network-based repository resources.
- The repository system must be able to generate code for all the facets of physical database management (such as create, modify, optimize, restructure) of all the database objects (such as domains, tables, views, triggers, events, schemas, and even entire databases) for multiple target systems (such as Oracle, CA-OpenIngres, Sybase, and so on), each with its own subtle syntax differences.
- Impact analysis is very important for physical database management. The DBA must be able to determine the impact of any change to the physical database. For example, if a table's column changes, which indexes, views, procedures, triggers, and events are affected by the change? In which databases is the specific version of the table used?
The most important analogies between the functions provided by conventional repository systems and a repository system adapted or extended for physical database management is shown in Table 5.
The proliferation of repositories is a big problem. The "theoretical" correct approach is to use one standard, integrated repository for all these functions -- which results in totally integrated metadata management. In practice, however, this does not happen (yet), because all these products have their own, fairly proprietary, repositories.
In the DBA Repository System (which I explain in detail next month), we took the viewpoint that most CASE tools can generate SQL-92 syntax (at least some adherence to standards!), with some variations for each particular DBMS. We therefore provide an import facility with which you can "synchronize" the DBA Repository from your CASE tool repository. It doesn't blindly bulk-load the metadata into the repository database; rather, it matches and resolves previous versions of the defined objects where it can. This is how we overcome the workload issue -- no one, not even different developers on the same project, wants to capture the metadata twice, because that would leave open the possibility for discrepancies between the logical and physical data models.
Until Next Month. . .
DBAs and application developers work in equally complex environments. They perform similarly complex tasks in order to manage multiple interrelated database objects for different target systems. DBAs inspect and manage the same managed objects, but from a different perspective -- they actually manage the database objects traditionally cross-referenced from applications, and they cross-reference the application objects themselves. However, they do not have highly integrated repository systems at their disposal.
Although some of the repository-based development systems cater to different logical and physical models, and although some of them can help the DBA deploy the physical model on development, testing, and production databases, they do not provide sufficient functionality for the DBA's day-to-day tasks. For example, they do not provide functions for re-indexing tables, re-gathering optimizer statistics, and re-allocating tables to different physical devices. On the other hand, the tools traditionally used by the DBA to perform these tasks, be they from a third party or part of the DBMS itself, are usually not repository based. They only contain the present state of the database objects, without facilities to document or manage previous or future versions of the database objects.
I have tried to show in this article that you can apply the functions performed by a conventional application-centric repository system equally well to physical database management. It's just the focus that changes. The repository objects that DBAs would use most are the actual database objects (that is, the tables and views to be accessed), with the relationships, integrity, and other constraints defined among them, as well as the where-used information (that is, which tables and views are used in which queries and applications).
Next month, I will describe a repository-based system that we at The Data Base Approach have developed for physical database management.
FIGURE 1

--A repository system. The repository is a knowledge base that contains metadata about the information an enterprise keeps and accesses, the processes it performs to be successful, and how it has implemented its application systems to support its information processing needs.
TABLE 1. The Employee Table
| EMPLOYEE
|
| EmployeeID | EmployeeName | Department
|
| 1345 | John Walker | Sales
|
| 4692 | Sue Brown | Repairs
|
| 1467 | Pam Johnson | R&D
|
TABLE 2. The EmployeeLeave Table
| EMPLOYEELEAVE
|
| EmployeeID | FromDate | ToDate
|
| 1345 | 1 Jan 1995 | 15 Jan 1995
|
| 1345 | 5 Apr 1995 | 10 Apr 1995
|
| 4692 | 1 Jan 1995 | 8 Jan 1995
|
TABLE 3. The Employee view of Tables 1 and 2
| EMPLOYEE
|
| EmployeeID | EmployeeName | Year | DaysLeave
|
| 1345 | John Walker | 1995 | 22
|
| 4692 | Sue Brown | 1995 | 6
|
| 1467 | Pam Johnson | 1995 | 0
|
TABLE 4. The Employee table with taxation laws included
| EMPLOYEE
|
|---|
| EmployeeID | EmployeeName | SocialSecurityNumber | Department | AppointmentDate
|
| 1345 | John Walker | 34-89675-234 | Sales | 4 Jun 1980
|
| 4692 | Sue Brown | 67-11201-829 | Repairs | 1 Jan 1994
|
| 1467 | Pam Johnson | 45-24011-114 | R&D | 1 Apr 1992
|
TABLE 5.
| ASPECT | APPLICATION-CENTRIC REPOSITORY SYSTEMS | PHYSICAL DATABASE MANAGEMENT REPOSITORY SYSTEMS
|
| Share repository metadata across business units | Application developers share global definitions and application libraries | DBAs share physical database objects and definitions
|
| Multiple versions of models | Versions of application design and implementation documents | Versions of logical and physical database design documents
|
| Multiple versions of object definitions | Versions of program code, shared libraries, test scripts, and so on | Versions of table and view definitions, trigger code, database definitions, and so on
|
| Adaptable/extendable meta-model | Application development approaches change | Object types stored in databases change
|
| Existing frameworks and standards | John Zachman's Information Systems Architecture (ISA) framework; AD/Cycle framework | John Zachman's Information Systems Architecture (ISA) framework; ANSI SQL standard
|
| Descriptions of objects for analysis, design and implementation | Application models, application designs, program code libraries | Data models, database designs, database definition scripts
|
| Change-management histories | Versions of application designs; versions of program code | Versions of database designs; versions of database definition scripts
|
| Cross-referencing documentation | Interrelated application objects (such as procedures calling other procedures); database objects used by applications | Interrelated database objects (such as tables used inviews and stored procedures); application components accessing database objects
|
| Component engineering information | Code in functions in procedures in application systems | Columns in tables in views in stored procedures in schemas in logical databases in physical databases
|
| Code generation | Code libraries, programs, application systems | Database management scripts, database definition scripts
|
| Change/impact analysis | Application components affected by changes in other application components | Application components and database objects affected by changes to other database objects
|
Comparison of conventional application-centric repository systems and physical database management repository items.
Martin Rennhackkamp is the owner and principal consultant of The Data Base Approach, a corporation specializing in relational and distributed databases, based in Cape Town, South Africa. You can reach Martin via the Internet at mr@dba.co.za.
Subscribe to DBMS and Internet Systems -- It's free for qualified readers in the United States
January 1996 Table of Contents | Other Contents | Article Index | Search | Site Index | Home
DBMS and Internet Systems (http://www.dbmsmag.com)
Copyright © 1996 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated Sunday, December 1, 1996