IO-Lite: A Unified I/O Buffering and Caching System

Vivek S. Pai, Peter Druschel and Willy Zwaenepoel (Rice)

Summary. The insight behind this paper is that by making buffers immutable (read-only), one can share them among the file cache, network buffers and application and, thus, eliminate all unnecessary data copying (some data copying of incoming network data required by hardware alignment requirements). They accomplish this by copying data that is modified into a new, now immutable buffer and adding a level of abstraction called a buffer aggregate to keep track of the resulting buffer chains. Experiments with a web server show 40-80% improvements on real workloads.

More Detail

The lack of integration between the IO systems (disk, network) and applications causes repeated data copying and double-buffering which leads to slowdown and memory misuse (memory used for, say, web cache used instead to store redundant data). Further, separate buffering mechanisms make it difficult to implement cross-domain optimizations such as TCP checksum caching and specialized cache replacement policies.

IO-Lite eliminates redundant data and double-buffering and, as a side-effect, supports cross-domain optimizations. IO-Lite deems all buffers immutable. That is, once they are initial written, they can no longer be modified. If a process needs to modify the data held in an immutable buffer, it allocates a second buffer large enough to hold the changes and writes the modified data there. A buffer aggregate consists of (ptr,len) pairs that keep track of the modifications. Buffer aggregates are mutable.

Deeming buffers immutable eliminates problems of consistency, protection, synchronization, and fault isolation. Aggregates are passed by value, but buffers are passed by reference and the pages associated with the buffers are made readable to the receiving domain. In this way, subsystems can share the same data.

There is a relatively simple API that applications/IO subsystems must be ported to in order to use IO Lite. The API is very similar to read and write and porting appears trivial.

Access control and allocation. An ACL is associated with each buffer (buffers can be comprised of several pages). Each ACL is associated with a pool of pages from which buffers can be made. To eliminate unnecessary copying, incoming network data is demultiplexed early using packet filters (to find the correct ACL so that buffers can be allocated from the correct buffer pool.

Cache replacement and paging. Unified buffer/cache systems must take into account both:

VM references
File read/write

when choosing a victim. IO-Lite punts on VM refs and uses only file references, but for their app, this might not make much difference since the server doesn't process the data in the pages -- only writes it to a socket.

The chosen page to swap is:

Least recently used of non-referenced page.
Least recently used of referenced page.

In the file cache, a cache entry is swapped out if, since the last entry eviction, more than half of the VM pages chosen were pages in the cached IO data, it is assumed that the IO cache is too large and an entry is evicted.

IO-Lite and mmap. For some applications, write-in-place is necessary for good performance. In this case, IO-Lite supports the mmap interface for creating a contiguous memory mapping for an IO object. In many cases, copying not necessary. However, if the data is coming off the network and is not aligned, the kernel must copy it. A copy must also be made if the data is referenced through an IO-Lite buffer. In this case, the data is copied lazily by page.

Cross-subsystem Optimizations. They modified the network stack to check in a cache for already-calculated TCP checksums. They add a generation number onto each buffer to make sure the checksum was calculated for the right buffer.

Implementation. They implemented IO-Lite on a BSD system and modified the Flash web server to use it. They did some experiments comparing Flash-lite's performance against Flash and Apache. When fetching a single document, did 43% better than Flash (eliminates the copy between the file cache and socket buffer) and 137% over Apache. Did even better for CGI programs because it eliminates the data copy between the CGI process and the server daemon process. Results showed that Flash-lite retained 87% of bw of static content; Apache and Flash retained approximately 50%. Concluded: with IO-Lite, there may be less reason to resort to library-based (dynamically linked 3rd-party code which dismisses fault isolation and protection) interfaces for dynamic content generation.

Ran an experiment using a real workload to show benefits from better memory utilization (bigger realized web cache) and custom cache replacement algorithm. Did well again.

WAN. Apache and Flash both used mmap to read files, primary source of double buffering is TCP buffer. The amount of data associated with TCP buffers is the number of concurrent connections open multiplied by the send-buffer size. Noted that with future increases in internet bandwidth, the max size of the buffer are likely to increase. Also, once the mbuf pool has grown to a certain size, it does not decrease. This makes it especially important to eliminate double-buffering of TCP buffers.

To quantify savings in socket double-buffering, ran an experiment where "slow" clients consumed data slowly, forcing the server to buffer TCP data (interesting technique). Flash-lite did not suffer as the number of clients increased; Apache and Flash did -- 42% and 30% throughput loss, respectively.

Questions / Comments

The "impact" section doesn't count additional memory used. This seems to be a huge flaw. I understand how empty, non-referenced pages are reused, but what about fragmentation within a page? How much overhead does that cause?
What's up with presentation in these OSDI 99 papers?
Didn't reference Stonebraker's "OSes Suck" paper from 1981 which talked about, among other things, the lack of support for application-level control of buffering and caching.
Would like "slightly modified" to be quantified. Does it take two hours, two days or two weeks to "slightly modify" a kernel?
Is it valid to ignore VM references in page replacement policy? Can we come up with a workload where this yields bad performance? Web servers are unique in that they do not pull into memory and modify their data.