Prince George's County Public Schools' Data Warehouse Provides Effective and Accurate Data for One of the Nation's Largest Public School Districts.
One of the foundations of American society is free public education for all children. In this country, education is mandatory until age 16, and public law requires the education of all children, including children who are not U.S. citizens. Yet many Americans agree that our public school systems are not doing a good job of educating children, and the financial cost of this perceived mediocre performance is considerable. The cost of public education in Maryland in taxpayer dollars falls between $5000 and $8000 per student per year. For students with special needs, this cost can exceed $100,000 per year. At a yearly average cost of $6500 per student, the total cost for a kindergarten to high school graduation education is almost $85,000. Over half of all local tax revenues is consumed by the education system. In Maryland, the local contribution -- about 55 percent -- exceeds $2.5 billion per year.
The cost of educating our children is high, but the cost of not educating them would be higher. The earning potential created by a good education is well documented. Studies have also linked mental and physical health to a good education. Unfortunately, statistics continue to show the United States falling behind the other industrialized nations in terms of educational performance. Although we do not lack for opinions on how to solve our educational challenges, some of the opinions are frequently controversial and conflict with other ideas. A few educational leaders thrive in this situation; they continue to make intelligent decisions despite a dearth of information to support or guide their decisions. Clearly, educational leaders would benefit from an information system to help them make or support decisions based on fact and research. However, the necessary information is limited at best. Contemporary information systems and information technology are not common in public schools.
There are at least four reasons for this: leadership, staffing, funding, and politics.
This article describes the efforts of the Prince George's County Public Schools to provide information for effective, timely, and accurate decision support for one of the 20 largest public school districts in the country, using the technology of data warehousing. One impetus for this effort was a study that indicated that success in college could correlate to the successful completion of both algebra and geometry in high school. This study included the finding that although minorities and women do not frequently enroll in math, science, engineering, or related majors in college, these enrollment trends were all but eliminated if the students were successful in algebra and geometry in high school. The College Board developed a special program called Equity 2000 to implement the concepts supported by this research. The College Board wanted to test this research and ensure that the schools involved were actually monitoring and supporting the enrollment and performance of algebra and geometry for all students. Initially, five school districts from around the country were selected to participate in a pilot program based on this research. Our school district was one of those selected. This school district serves approximately 120,000 students (including a large minority population), and it has an annual budget of more than $700 million. The College Board wanted to ask several simple questions during this pilot project:
Many of the computer systems and software were developed or modified in-house over a period of years. Although some systems were well defined, the documentation of computer systems was frequently inadequate. To overcome the lack of documentation, we frequently had to consult with the local organizational expert or examine the program code.
Many of the challenges we faced have already been well described by other case studies in DBMS. For example, we also had to contend with capturing, scrubbing, describing, and storing data from a variety of legacy data systems. Likewise, we found some inconsistencies among common data elements on different legacy systems. One example was the variety of ethnic and gender categories found in the different legacy systems. Males and females were described as 1 or 2, M or F, B or G, or with full words such as male or girl. Many school districts have standardized the five federal race categories of Native American, Asian, African American, White, and Hispanic; we found these categories represented in school data systems as full labels, partial labels, or numerical codes. More troublesome than the manageable but inconsistent coding were the legacy systems that reported conflicting data; one system might report a student as a Hispanic male, and another might report the same student as an African American female. Drilling down to the individual student is certainly a reasonable goal of a student-centered data warehouse. Finding that a Hispanic male is labeled an African American female will adversely affect you and your data's credibility. Our solution was to identify the legacy data source that had responsibility for each unique data element. Any questions about the correct value for that data element were referred to the data source responsible for that data element. One result of this procedure was the gradual reduction of data anomalies from all legacy data sources. I refer to this as the "you don't neglect what you inspect" principle.
In some ways, school system data needs are different from most industry needs. A school district data warehouse need not be concerned with rapid update cycles. Most school data, such as student enrollment or grading, can be loaded or updated every two to three months. Some data, such as standardized testing, must only be loaded or updated once a year. Although this schedule makes the job less hectic, a school district data warehouse must also plan for a very long data retention requirement. Public law obligates the school systems to educate, and maintain records for, any child up to age 21. Research requirements may require an even longer maintenance of records. The longer the data must be maintained, the greater the problems of data drift. Data drift is the gradual change in the meaning or value of a data element. Depending on the number of data elements needed and the long data retention time required, data drift can be a serious concern. The complex relationships and diverse volume of data required for modern school evaluations also require careful design considerations. In school data analysis, you can never be too rich, too thin, or have enough data. Figure 1 depicts the normal data integration frequently found in public school districts.
Specific data is gathered from these separate legacy data sources as required. All data accepted into the data warehouse is processed using the steps shown in Figure 2. These steps include data collection, data scrubbing/confirmation, data enhancement, SQL database, summarized database, and data analysis.
Because this data will be used not only for high-level decision making but also for accountability, the data scrubbing and data confirmation procedures are enhanced beyond simply standardizing the data elements. Data is checked to ensure that it is loaded and that all data elements that are not free-form comply with expected values. Additionally, data anomalies found up the pipeline in the analysis stage are returned to the data scrubbing stage and to the responsible legacy system for correction or explanation. The data enhancement step includes adding additional data elements or variables to the original data set. These data elements are added to help augment meaning for the records and also to add in the use of data for evaluation. A good example of value-added data is course equation codes. Our school district has over 2000 legitimate course numbers for the four years of high school. Asking a simple question -- such as "how are the high school students doing in math?" --results in an almost indecipherable amount of data and little relevant or useful information. Course equation codes (an example of value-added data or data enhancement) provide the necessary structure to permit a reasonable and useful answer to the original question. One of the constraints of our data warehousing task was not to interfere with the normal operation of the existing legacy systems. Developing course equation codes was one method for us to add needed structure to answer the question: "In which courses are the students enrolled, and how are they doing in these courses?" Developing course equation codes required the expertise of the four core content area supervisors for math, science, English, and history. An unexpected benefit of this project was increased awareness of the content area supervisors over the courses they supervised.
Once the data has been scrubbed, confirmed, and enhanced, it is ready to be added to the database as SQL expert data. SQL expert data is intended only for researchers who are technically advanced, comfortable with SQL, and knowledgeable about the nature of the data. Lightly and highly summarized data has been formatted for ease of use and to eliminate some of the constraints of SQL and the complexity of the data. One of the most important concepts of lightly summarized data is the elimination of one-to-many or many-to-many relationships and the enhancement of the metadata and design around a commonly agreed principle. Basically, lightly summarized data includes one, and only one, record per student concerning a single data area such as core course matriculation and performance.
Figure 3 shows a high-level Venn diagram of the five levels of data that compose our school district data warehouse. The first level contains all available data. Access to this data without direct expert assistance is not generally permitted. Some of this data will be offline and stored on tape. The next level is SQL Expert Data: data that has gone through the quality checks and enhancing stages and has been entered into the database. SQL Expert Data represents a specific database design. Access to this data requires expert skill in relational database management and SQL skill, as well as expert knowledge in the nature of the data. The third level contains lightly and highly summarized data, which has been reformatted and often denormalized to allow access to several different types of data without expert knowledge in SQL or RDBMS. This third level is the back end to front-end data analysis tools. Although it is conceptually the smallest of the main data pools, more than 90 percent of all information extraction comes from this level. The fourth level is Value Added Data Elements, which constitutes data that was not included in the source data but has been added to aid in the understanding and use of the data. This Value Added Data is found in the second and third levels to assist in analysis. The last level is Metadata, which is usually defined as data about data. Because school data is so complex, this level is much more detailed than the metadata used in other areas of the computer information industry.
Cost and available skill were (and remain) major driving forces in deciding which hardware and software tools to use. Our system is running on standard IBM-compatible computers, primarily from Gateway 2000 and Dell. The database server is a Dell SP5 Pentium 90 with 32MB of RAM and 15GB of hard drive capacity. The server is running Windows NT 3.5.1 as its operating system. Once configured, NT has been very reliable; I suspect that some of this reliability results from the lack of human interaction. The client machines vary from 486/66MHz to Pentium 133MHz PCs, all with 16MB of RAM and at least 1GB of hard drive capacity. The client PCs are running Windows 95, and we have one office file server, a 486/33MHz with 8MB of RAM, also running Windows 95. When I first started this project, we ordered all PCs with SCSI hard drives. In 1993 (it doesn't seem that long ago), if you wanted more than two hard drives and you needed over 1GB on the hard drive, SCSI was one of the few choices. In the past three years and with constant use, none of the SCSI drives have failed. In the past year, IDE and EIDE hard drives with a capacity of over 1GB have been available and inexpensive. However, we have experienced several failures with IDE and EIDE drives. Our experience has been that a name-brand computer such as Dell or Gateway 2000 using SCSI devices outlives its useful life before any equipment fails. We have an Ethernet LAN, using the built-in Microsoft networking in Windows NT and Windows 95, and all client and server PCs are attached to a UPS. This setup ensures that the LAN can stay up during our many power fluctuations. The database is R:Base. We have found that we need to run both the latest Windows version of R:Base (currently 5.5a) and one of the earlier versions of DOS (4.5++). The more primitive the legacy system data extraction, the more useful we find the earlier DOS version of the database tools and features. R:Base was selected because of its SQL power and extended SQL tools. The built-in front-end tools, easy application generator, data input and export, and database server features were additional benefits of R:Base. The front end to the database is a combination of SPSS Inc.'s SPSS (a statistical analysis package), Microrim Inc.'s R:Base, and Seagate Software Inc.'s Crystal Reports. The three front-end tools provide three levels of information extraction. SPSS is used for complex analysis that sometimes requires the additional statistical tools SPSS provides. Version 7 of SPSS has greatly enhanced the ease of use for this powerful tool. R:Base comes with a very easy-to-use application generator and report writer. Crystal Reports has the easiest interface while still providing plenty of information extraction power.
This school district data warehouse provides educational decision makers with reliable, timely, and understandable information from the wealth of often disconnected legacy systems data. This system offers a quick, easy mechanism to understand the complex data that makes up a student's school experience -- without requiring specific SQL skills. This system also ensures that this data is accurate enough to be used for research and accountability.


