graefe.query_eval_techniques
Goetz Graefe, 1993
Summary. Graefe claims that there are many fields where database
management systems could be highly beneficial yet are not in use
because of the unwieldiness of database interfaces and the "real or
conceived " poor performance for extremely large
databases. Traditional file systems are being used instead.
To be able to take advantage of the many advantages of database extraction,
Graefe points out that we need to fix this ``real or conceived'' performance
problems. The paper is a survey of efficient algorithms and query execution
techniques.
The paper begins by explaining that there is a logical separation of the
physical data and the high-level user interface of the database. At the
high level, we think of the data in terms of relations, tuples & lists and
use ``logical operators'' to manipulate the data. At a lower level, we
can think of the data in terms of records, tuple ids or even physical
data blocks and use ``physical operators'' to manipulate the data. The job
of the Query Optimizer is to map the logical operators onto the physical
operators -- a mapping that is most likely complicated as the operators
do not have one-to-gone relationships. The product of such a mapping
is the Query Plan. The job of the Query Execution Engine is to execute
this plan.
In building the plan, the cost of different execution alternatives must
be considered. There are a number of different cost measures - I/Os being
very popular. In the sections that we read for 2/6, different algorithms
(and their costs) for sorting, disk access and aggregate & duplicate
removal are presented.
- Sorting. One can choose to Sort or Hash. In both alternatives, the
cost is predicted by how many I/Os will be performed. Since the paper
is concentrating on very large databases, the fact that the dataset
may be larger than memory must be taken into consideration. Rather than
being able to do an in-core sort, one must partition the dataset into
smaller sorted bins & then merge them. (I've used hash terminology but
this is the case with both sorting & hashing.) There are trade-offs
between the ``cluster'' (size of data we can access efficiently at once)
and the ``fan-in'' for merging (how many different merges must be done).
- Disk access. Disk access can be improved by using ``read-ahead'' buffers
where contiguous file storage is supported. Indices are offered as an
obvious cost-saving measure. It is once again pointed out that the
LRU buffer management policy is common but doesn't work in many situations.
On the up-side, the iterator technique of query execution (explained
in a previous section in the paper) does very well with LRU.
- Aggregation & Duplicate Removal. Three types of algorithms are offered:
nested loops, sorting and hashing. Nested loops, looking at each record
and then putting it in it's aggregate group (or removing it if it is
duplicate) is not usually cost effective because of the number of
temporary file writes that must be done. Sorting & hashing are o(lgn)
and both can take advantage of ``early'' aggregation or duplicate removal.
For example, when a duplicate is detected it can be removed right away.
This decreases the size of the last file to be processed to the size of
the output file.