graefe.query_eval_techniques

Query Evaluation Techniques for Large Databases

Goetz Graefe, 1993

Summary. Graefe claims that there are many fields where database management systems could be highly beneficial yet are not in use because of the unwieldiness of database interfaces and the "real or conceived " poor performance for extremely large databases. Traditional file systems are being used instead.

To be able to take advantage of the many advantages of database extraction, Graefe points out that we need to fix this ``real or conceived'' performance problems. The paper is a survey of efficient algorithms and query execution techniques.

More detail

The paper begins by explaining that there is a logical separation of the physical data and the high-level user interface of the database. At the high level, we think of the data in terms of relations, tuples & lists and use ``logical operators'' to manipulate the data. At a lower level, we can think of the data in terms of records, tuple ids or even physical data blocks and use ``physical operators'' to manipulate the data. The job of the Query Optimizer is to map the logical operators onto the physical operators -- a mapping that is most likely complicated as the operators do not have one-to-gone relationships. The product of such a mapping is the Query Plan. The job of the Query Execution Engine is to execute this plan.

In building the plan, the cost of different execution alternatives must be considered. There are a number of different cost measures - I/Os being very popular. In the sections that we read for 2/6, different algorithms (and their costs) for sorting, disk access and aggregate & duplicate removal are presented.

Sorting. One can choose to Sort or Hash. In both alternatives, the cost is predicted by how many I/Os will be performed. Since the paper is concentrating on very large databases, the fact that the dataset may be larger than memory must be taken into consideration. Rather than being able to do an in-core sort, one must partition the dataset into smaller sorted bins & then merge them. (I've used hash terminology but this is the case with both sorting & hashing.) There are trade-offs between the ``cluster'' (size of data we can access efficiently at once) and the ``fan-in'' for merging (how many different merges must be done).
Disk access. Disk access can be improved by using ``read-ahead'' buffers where contiguous file storage is supported. Indices are offered as an obvious cost-saving measure. It is once again pointed out that the LRU buffer management policy is common but doesn't work in many situations. On the up-side, the iterator technique of query execution (explained in a previous section in the paper) does very well with LRU.
Aggregation & Duplicate Removal. Three types of algorithms are offered: nested loops, sorting and hashing. Nested loops, looking at each record and then putting it in it's aggregate group (or removing it if it is duplicate) is not usually cost effective because of the number of temporary file writes that must be done. Sorting & hashing are o(lgn) and both can take advantage of ``early'' aggregation or duplicate removal. For example, when a duplicate is detected it can be removed right away. This decreases the size of the last file to be processed to the size of the output file.