
How indexes and other tricks can help -- or hinder -- your performance tuning efforts.
Indexes are nothing new. Recent archeological evidence suggests that they were among the tools used by Neanderthals 50,000 years ago. Well okay, I'm exaggerating a little bit, but not much. Index searching techniques were invented early on by computer programmers as a means of reducing disk I/O, and they were incorporated into several proprietary file systems prior to the advent of relational databases.
This article focuses on the indexing features and other access mechanisms available in today's SQL DBMSs. Although data manipulation commands are reasonably standard across SQL products, the indexing features are less standard because they are closer to the underlying physical architecture. Rather than attempt to list every variation in every SQL product on the market, I compare four representative DBMSs that are used in departmental through enterprise-wide applications: Oracle's Oracle7, Computer Associates' CA-OpenIngres, Gupta/Centura's SQLBase, and Microsoft's SQL Server.
The classic analogy to help you understand database indexes is the index in the back of reference books. Sure, if you wanted to find everything in the book about a particular subject you could start at the beginning and scan every page, but it is much faster to look in a smaller, alphabetized subject index that directs you to a list of pages. Then you need to scan only those pages to find information about your chosen subject. Not everything in the book is indexed, however, so if your subject is not mentioned in the index, you must still scan for it. Likewise, a database index is a look-up mechanism that helps a DBMS find the information you request faster than it could with a full scan. As with book indexes, not everything in the database is indexed, so an occasional scan may still be necessary.
The primary reason to build an index is to improve performance. But it is not the only reason to build an index. The second reason has to do with enforcing uniqueness among rows stored in a database table. Tables in a SQL database are usually designed with a primary key; that is, a set of columns with a unique value that identifies a row in the table. When a new row is inserted into a table defined with a primary key, it is up to the DBMS to ensure that the primary key value for that row is unique. Performance would be unacceptable if the DBMS had to scan the entire table each time a new row was inserted. Therefore, the accepted solution is to build a unique index on the primary-key columns and let the DBMS use that as the physical enforcement mechanism for the primary key uniqueness requirement. Some SQL products (such as Oracle7) implicitly create a unique index when you define a primary key during table creation, while others (such as SQLBase) require you to create a unique index before using a table created with a primary key.
While it's mandatory to build a unique index on a table's primary key, there are other times when an index is desirable. You can also use them on foreign keys; that is, the set of columns that allows one table to be joined to another table. Because the foreign key is always joined with the primary key of another table, it should be possible to use the unique index on the primary key of the second table to speed up the join. For performance and concurrency reasons, it is desirable to give the DBMS the option of using an index on the foreign key, even though an index already exists on the primary key in the referenced table. In practice, foreign-key values often appear as restriction criteria in a where clause, so you can use the index there as well.
You should also build indexes whenever columns are used frequently in a where clause, even though they are not part of a primary or foreign key. The more indexed columns in a where clause, the more likely it is that a DBMS will be able to use an index to speed up query performance. But there is a trade-off here: Not enough indexes results in slow queries; too many indexes results in slow changes to the database.
The best things in life may be free, but indexes have their price: increased overhead for index maintenance during the processing of insert, update, and delete statements. Not only will changes to the database be slower, but if every column in a database is indexed, query performance might suffer because index maintenance could consume the bulk of the CPU cycles. Plus, you might have locking conflicts. Thus, the challenge is to find a happy medium between too few indexes and too many indexes.
One way to meet this challenge is to look for situations in which indexes should be avoided. First, you should avoid indexes on small tables. If a few disk reads is all it takes to scan an entire table, then an index can actually slow things down because it requires an extra disk read or two. What qualifies as a small table? The size limit varies depending on the hardware, operating system, DBMS, and row size, but any table under 100 rows is a candidate. If there is any question in your mind about whether to create an index on a small table, test the performance with and without the index.
Second, you should avoid indexes on tables that will always be accessed by scans (such as transaction tables in which each row is fetched by a cursor, or statistical tables that summarize with aggregate functions). In these cases, an index would be redundant.
Third, you should avoid placing indexes on columns that have few distinct values. The best candidates for indexing are columns with unique values; whereas the worst candidates are columns with only a few possible values (the classic example is a column named "sex," which can have the values "F" or "M"). It's more efficient for a DBMS to scan the entire table than to use an index to find 50 percent of the rows. (I've even seen someone index a column with one distinct value, but that's another story.)
It's worth noting that binary large object (BLOB) columns (also known as "long" or "raw" columns) typically cannot be indexed. And for good reason: An index on a BLOB column would take up almost as much disk space as the table. Therefore, performance would suffer. Efficient access to BLOB data is an area of intense research and development effort by DBMS vendors, and many of the "object" features being added to SQL DBMSs are aimed at solving this problem. (See Maurice Frank's article, "Future Database Technologies Now," DBMS, November 1995, page 52 for a discussion of content-based queries of multimedia data.)
So when should you create indexes? This is an easy decision to make. If a table will be populated by interactive data entry, you should create indexes on the table immediately after you create the table. On the other hand, if the table will be populated by bulk loading of data from a file or from another table, you should create the indexes after the data is loaded. The rationale for delaying index creation in the latter case is that it is much more efficient to build an index all at once instead of updating the index every time a row is added. With interactive data entry, the index maintenance inefficiency is generally transparent to the user, but with bulk loading it is very noticeable.
In its simplest form, the create index command looks like the following:
CREATE INDEX sales_1996_idx
ON sales_1996 (customer, product);
In this example, the name of the index is "sales_1996_idx," the name of the table being indexed is "sales_1996," and the columns that make up the index are "customer" and "product."
However, there are many variations in different vendors' products. One popular option is to allow room for growth at the time an index is created. You can do this by leaving free space in the disk blocks used to store the index. Oracle7 has a pctfree parameter that specifies the percentage of space to leave free, whereas
Microsoft SQL Server has a fillfactor parameter that specifies how much space to fill up initially. Similarly, most DBMSs sort the data before building the index. If the data is already in sorted order, Oracle7 lets you eliminate the sort with a nosort parameter, whereas SQL Server does the same thing with a sorted_data parameter.
Many DBMSs let you create indexes on a separate disk or volume from the data tables (Oracle7 with the tablespace parameter, and SQL Server with the on segment_name parameter). Oracle7 goes even further with its storage parameter, which allows experienced DBAs to control the initial size of the physical extent allocated to the index on disk as well as the incremental size of additional extents allocated as the index grows. Physical space management within a database is a big topic and the details are beyond the scope of this article, but suffice it to say that you can realize significant performance gains by separating data and indexes onto separate disks and avoiding fragmentation (minimizing the number of physical extents used to store the index).
Here's a performance tip that's easy to use: When creating a multicolumn index, put the most selective column first (that is, the one with the most distinct values). If your company has 500 customers and five products, the create index example shown previously would be appropriate. However, if your company has five customers and 500 products, it would be better to create the index with the following command:
CREATE INDEX sales_1996_idx
ON sales_1996 (product, customer);
Because product is 100 times more selective than customer in this example, index lookups will tend to find the desired rows with fewer disk reads, thereby resulting in better performance. On top of that, you will usually be able to perform partial index lookups on product alone, with a speed close to what you would get by building a separate index on product. Furthermore, partial index lookups on customer alone aren't necessary in this case because of the low selectivity, so you don't lose anything by putting customer last. If you put customer first, you lose the ability to perform partial index lookups on product alone.
The most common physical storage structure for SQL indexes is the B-tree. (Note: This article will use the general term B-tree to represent all of the related variations including B+-tree and B*-tree). Almost every SQL DBMS on the market supports B-tree indexes. Plus, some DBMSs support additional physical index structures such as hashing and Index Sequential Access Method (ISAM). B-tree indexes are popular because of their adaptability (the tree structure balances itself dynamically as a table grows, which maintains an efficient index structure by minimizing the number of disk reads to find a given value in the index).
An index is usually thought of as a distinct data structure on disk, separate from the table it is indexing. Although this is generally true, some DBMSs allow you to combine the index and the table data to improve performance. For example, SQL Server has a clustered parameter on the create index command that stores the actual data at the leaf level of the index tree. Because the table is so tightly integrated with the index, only one clustered index per table is possible. Similarly, CA-OpenIngres has a modify table command that restructures a table in the form of an index.
Although modify table is not part of the create index command, it really does the same thing as the create clustered index command in SQL Server. The motivation for combining index and table into one data structure is faster data retrieval -- when an index search gets to the leaf level of a combined tree, it already has the desired data, so it doesn't have to perform an additional read from a separate data structure.
Oracle7 defines "cluster" differently than SQL Server. An Oracle7 cluster is a data structure that combines multiple data tables to improve performance. Rows from different tables that share key values are physically stored together on disk. Used wisely, this feature can improve join performance and reduce disk storage space requirements, but it also has the potential to reduce the performance of table scans and insert/update statements.
Oracle7 clusters can be either indexed or hashed. Indexed clusters require creation of a separate index for efficient searching, whereas hash indexes use a hashing algorithm to store and retrieve data without a separate index. However, hashing is appropriate only when you know the expected number of rows at the time you create the cluster and when you expect to perform full key searches for individual rows. Partial key searches and range searches are not supported by hashing algorithms.
Oracle7 is not the only product to support hashing. SQLBase has a create clustered hashed index command that converts individual data tables into hash structures without the need for a separate index. Once again, the term "cluster" has a different meaning. In SQLBase it means that rows with equal key values are stored together, as opposed to SQL Server, in which rows are ordered by key value, or Oracle7, in which rows from multiple tables are stored together. CA-OpenIngres also offers hashing in both the modify table and create index commands. The CA-OpenIngres modify table to hash command is equivalent to the SQLBase create clustered hashed index command in that it converts a data table to a hash structure. The create index ... with structure = hash command leaves the table unchanged and creates a separate index in a hash structure.
CA-OpenIngres, in the spirit of openness, also offers ISAM as an option. Like B-tree, ISAM supports partial key searching, range searching, and pattern matching. Unlike B-tree, ISAM is a static data structure that doesn't balance itself as it grows, resulting in overflow pages and less efficient retrieval. Thus, like hashing, ISAM is best suited to situations in which the number of rows is known ahead of time or there are time and resources available to rebuild the index manually as the table grows. The CA-OpenIngres ISAM option works with both the modify table and create index commands.
One common feature in pre-relational DBMSs was the need for the application developer to specify when to use indexes. In today's SQL products, that burden has been lifted from the developer's shoulders by the optimizer, which is a part of the database engine that evaluates each SQL statement and decides which access paths are most appropriate. The developer works at the logical level, while the optimizer translates logical requests to use the most efficient underlying physical mechanisms.
The first generation of optimizers in the mid-1980s was less than ideal and occasionally prone to making bad decisions that resulted in poor query performance. For the most part, they have now matured into reliable, although not infallible, decision makers. As you might expect, optimizers vary considerably from one SQL product to the next -- not only in syntax but in the way they work. The two basic types of optimizers in use today are rule-based and cost-based.
Rule-based optimizers evaluate the SQL statement being executed and choose which indexes to use without considering the statistical distribution of data in the tables being accessed. This approach works well with small queries that access few tables and few indexes, but it tends to have trouble with more complex queries involving more tables. Rule-based optimizers also tend to be sensitive to how a particular query is written, requiring developers to tune SQL statements for maximum performance. Earlier versions of the Oracle optimizer were rule-based, but Oracle7 offers both options and the documentation recommends that applications be migrated to the cost-based optimizer (the rule-based optimizer will be phased out in future versions).
Cost-based optimizers use statistics about the data to help determine the most efficient access path for a given query -- that is, the path that is least costly in terms of disk I/O. In general, statistics are not collected automatically by the DBMS because that would have too much impact on insert/update/delete performance. Instead, a variety of SQL commands or utility programs collect statistics as necessary. Both SQL Server and SQLBase have an update statistics command, Oracle7 has an analyze command, and CA-OpenIngres has an Optimizedb utility. Statistics can be collected on tables (for example, number of rows and number of distinct values per column) and on indexes (for example, depth of index tree from top to bottom and number of distinct index values). One rule of thumb is that statistics should be collected whenever users have recently added or modified approximately 10 percent or more of the data.
Once the statistics are collected, the optimizer can use them to help decide which indexes will result in the low-cost goal. Keep in mind, however, that there are different ways of measuring query cost: one is how long it takes to return the first row of the result set, and another is how long it takes to return the entire set. The Oracle7 optimizer accepts "hints" embedded in comments that tell it which type of cost to favor. The first_
rows hint chooses a goal of best response time:
SELECT /*+ FIRST_ROWS */ customer, sale_date, amount
FROM sales_1996
WHERE product = 'WIDGET';
Similarly, the all_rows hint chooses a goal of best throughput. There are also hints available in Oracle7 that give the developer more control over the physical access path chosen by the optimizer. The full hint causes the optimizer to choose a table scan even if an index is available:
SELECT /*+ FULL(sales_1996) */ product, sale_date, amount
FROM sales_1996
WHERE customer = 'GUPTA';
The index hint causes the optimizer to use a specific index even if others are available:
SELECT /*+ INDEX(sales_1996 sales_1996_idx) */ SUM(amount)
FROM sales_1996
WHERE product = 'GADGET';
Another hint is ordered, which causes the optimizer to join tables in the order in which they appear in the from clause. Oracle7 has a number of other hints available, but the downside of using hints is that the developer must shoulder more responsibility for understanding the distribution of the data and tuning the SQL statements accordingly. Most developers seem to favor letting the optimizer do the work unless there is a situation the optimizer can't handle efficiently. In this case, the developer must step in and use whatever tools are available to improve performance.
Finally, it's worth mentioning that in most SQL products information about indexes is stored in the system catalogs maintained by the DBMS. In Oracle7, the tables are user_indexes, user_ind_columns, index_histogram, and index_stats. In SQLBase, the tables are sysindexes and syskeys. Although the structure of these system catalogs varies considerably from product to product, they hold a wealth of information, particularly for a developer who is called in to work on a database that was designed and built by someone else.