parallel


Parallel Database Systems: The Future of High Performance Database Systems

David DeWitt, Jim Gray, 1992

Summary. Parallel databases have really taken hold for a number of reasons:

A general consensus on the architecture design.has formed: a shared-nothing architecture where each processor has its own disk & memory. Data can be partitioned from one input stream, processed, and then merged to one output stream.

Though there is still plenty of research to do, benchmark results seem to say that parallel databases are wave of the future.


More Detail.

Parallelism can be achieved in two ways: pipelined, moving already processed data to the next stage as it is available, and partitioned, dividing the data into partitions, processing them separately and then combining the output streams. Most query plans have little opportunity for pipelined parallelism.

Ideal parallelism can be measured in terms of linear speedup and linear scaleup. Speedup is the improvement gained in performance time by running a query of size X in parallel. Scaleup is the improvement gained in the amount of work that can be executed in some time T. Scaleup can be batch (the size of one large job that can be done in time T increases) or transactional (the number of small transactions that can be done in time T increases).

Barriers to achieving linear speedup:

Architecture. A taxonomy for categorizing || architectures:.

Data Partitioning. Three methods for data partitioning (I think these.strategies can be used to partition intermediate results as well as the ``permanent'' relations).

Partitioning can lead to data and execution skew. One strategy to avoid.skew is to look at the temperature of a tuple and attempt to make the aggregate heat of all partitions equal.

Partitioning leads to new database administration issues.

Parallelism within Relational Operators. How do we allow existing sequential operators (select,project,join) to execute in parallel? Since the operators expect the input in one stream and the output in one stream, use split tables to split the data into multiple streams. Merge the final output. Each processor will have input and output ports that the split/merged data can be directed from/to.

Problems Still to be Solved.