Automatic I/O Hint Generation through Speculative Execution

Fay Chang and Garth Gibson (CMU)

Summary. The authors present a speculative execution system for IO prefetching. They show that their system can decrease execution time nearly as well as manually-hinted prefetching. They:

Use time spent waiting for I/O to speculatively execute and, in doing so, fetch pages which may be accessed in the future.
Use a shadow thread for speculative execution.
Avoid modifying state by using copy-on-write for data in the speculating thread and using software fault isolation to sandbox it.

More detail

There has long existed a disparity between IO speed and processor speed such that the latter often stalls waiting for the former. High-performance disk systems have up to 10ms latencies -- millions of cycles. Asynchronous write-behind buffering helps to close the gap for IO writes but cannot help the read case since the values to be read are often needed immediately by the processor.

Prefetching background. "Prefetching, requesting data before it is needed in order to move it from a high-latency locale to a low-latency locale is a well-known technique for hiding read latency." A table condensing the evolution of prefetching described in the paper is here

Architecture. Introduce a speculating thread at low priority which runs during IO idle times. Ensure correctness in spite of speculating thread by catching signals, disallowing system calls which have undesirable side-effects, and using software enforced copy-on-write techniques. The spec thread issues TIP (see TIP background here) hints and records the hints issued in a hint log. When the original thread is about to perform an IO read, it checks the hint log. If there is no next entry in the hint log or if the next hint is not correct, the spec thread is deemed off-track and is set on-track by setting its PC as if it had just returned from the current read.

If the speculative thread strays too far from the execution path, will not help and could hurt performance.

Correctness. Below is more detail on the 3 ways in which the speculating thread could change the apps behavior and how they are avoided.

Change code or data values. Software-enforced copy-on-write: Inspired by SFI, they instrument the code with checks around loads and stores and add a data structure which maps original memory regions to copied memory regions. Before any store, the code checks whether the memory to be accessed has been copied and, if not, copies it and updates the mapping table.
Produce side-effects visible outside the code. Forbid thread from issuing system calls besides the TIP calls, fstat() and sbrk().
Use inappropriate data values (divide by 0 or access illegal address). Catch any signals thrown and halt execution of the thread until the next IO read.

Load and stores comprise approximately 30% of the instruction mix. To lessen the cost of the load/store checks, they make a copy of the binary text section and leave the original uninstrumented.

They assume a system with strictly prioritized thread (or speculative thread could run at times other than when main threads blocked on IO). Also assume that disk-striping in effect to increase disk bandwidth.

Performance. The performance closely matches that of manually modified applications. Impressive.

Additional overhead now includes (but obviously the negative impact doesn't overcome the performance gain):

Increased image size since the text section is copied and instrumented.
Inaccurate hints can take IO bandwidth and cache space (manually modified issues no inaccurate hints).
Copy-on-write copies takes memory.

Prefetching background.
This table condenses the evolution of prefetching described in the paper. Solutions on the left solve problems to the right. The problems on the right address the product of the solution immediately above.

Problem

Solution

Requires that the IO system provide more bandwidth than the application requires.

Striped, RAID disks.

Inaccurate prefetching wastes IO bandwidth and cache space.

Need accuracy.

System parameters, some of them dynamic, affect and render difficult prefetching accuracy (cache size, access latency and bandwidth, load on cache and IO system).

Manually modify applicatin to issue hints for future reads rather than hard-coding asynchronous reads.

Need source code & involves formidable programming effort/knowledge.

Automatic approaches. Most widespread is sequential read-ahead performed by most OSes.

Limited utility when files small; does not help and could hurt apps with nonsequential access patterns.

History-based prefetching: OS gathers data on access patterns and uses it to infer future reads.

Good for multiple app access patterns (edit-compile-run), but a tradeoff exists between the amount of data gathered and achievable resolution in prefetching decisions. The system must limit the amount of history taken by either shortening the history cycle, tracking only the most frequently occuring events or tracking only certain types of events. In choosing a policy, the system will sacrifice its ability to predict accesses of apps whos access patterns vary widely between runs or are run infrequently.

Compiler-driven (usually) static analysis.

Good for loop-intensive, array-based apps; bad when IO is an "outer loop" activity with inner CPU activity separated by "many layers of abstraction (procedure calls and jump tables, for example)".

Dynamic self analysis.

TIP overview: an informed prefetching and caching manager; earlier CMU work by Hugo Patterson. To help alleviate the contention for cache space, separate access understanding from resource allocation. Applications issue informing hints and an underlying system makes IO request decisions taking into account cache contention, accuracy of previous hints, and immediacy of the hint. TIP has a simple hint interface with three calls that nearly mirror the Unix fileblock read calls. Applications make these calls to TIP and TIP decides when/whether to do the prefetch.

Questions / Comments

Paper presentation is wonderful. The prefetching background section explains the problem using a natural evolution of the prefetching solutions offered up to this point.
Would have liked to have seen database benchmark since databases aleady highly optimized for prefetching.