Summary. The authors present a speculative execution system for IO prefetching. They show that their system can decrease execution time nearly as well as manually-hinted prefetching. They:
Prefetching background. "Prefetching, requesting data before it is needed in order to move it from a high-latency locale to a low-latency locale is a well-known technique for hiding read latency." A table condensing the evolution of prefetching described in the paper is here
Architecture. Introduce a speculating thread at low priority which runs during IO idle times. Ensure correctness in spite of speculating thread by catching signals, disallowing system calls which have undesirable side-effects, and using software enforced copy-on-write techniques. The spec thread issues TIP (see TIP background here) hints and records the hints issued in a hint log. When the original thread is about to perform an IO read, it checks the hint log. If there is no next entry in the hint log or if the next hint is not correct, the spec thread is deemed off-track and is set on-track by setting its PC as if it had just returned from the current read.
If the speculative thread strays too far from the execution path, will not help and could hurt performance.
Correctness. Below is more detail on the 3 ways in which the speculating thread could change the apps behavior and how they are avoided.
Load and stores comprise approximately 30% of the instruction mix. To lessen the cost of the load/store checks, they make a copy of the binary text section and leave the original uninstrumented.
They assume a system with strictly prioritized thread (or speculative thread could run at times other than when main threads blocked on IO). Also assume that disk-striping in effect to increase disk bandwidth.
Performance. The performance closely matches that of manually modified applications. Impressive.
Additional overhead now includes (but obviously the negative impact doesn't overcome the performance gain):
Problem |
Solution |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Requires that the IO system provide more bandwidth than the application requires. | Striped, RAID disks. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Inaccurate prefetching wastes IO bandwidth and cache space. | Need accuracy. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
System parameters, some of them dynamic, affect and render difficult prefetching accuracy (cache size, access latency and bandwidth, load on cache and IO system). | Manually modify applicatin to issue hints for future reads rather than hard-coding asynchronous reads. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Need source code & involves formidable programming effort/knowledge. | Automatic approaches. Most widespread is sequential read-ahead performed by most OSes. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Limited utility when files small; does not help and could hurt apps with nonsequential access patterns. | History-based prefetching: OS gathers data on access patterns and uses it to infer future reads. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Good for multiple app access patterns (edit-compile-run), but a tradeoff exists between the amount of data gathered and achievable resolution in prefetching decisions. The system must limit the amount of history taken by either shortening the history cycle, tracking only the most frequently occuring events or tracking only certain types of events. In choosing a policy, the system will sacrifice its ability to predict accesses of apps whos access patterns vary widely between runs or are run infrequently. | Compiler-driven (usually) static analysis. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Good for loop-intensive, array-based apps; bad when IO is an "outer loop" activity with inner CPU activity separated by "many layers of abstraction (procedure calls and jump tables, for example)". | Dynamic self analysis. |