Lazy Receiver Processing (LRP): A Network Subsystem Architecture for Server Systems

Peter Druschel and Gaurav Banga (Rice), 1996

Summary. Network servers can experience slowdown and even livelock under overload. The authors present a network subsystem architecture which alleviates (or eliminates if the NI is programmable) these overload problems via a combination of early-demultiplexing and lazy receiver processing.

More Detail

The proposed architecture depends upon both early demux and LRP to provide increased throughput under increased load. Below describes how each can result in one or more pathological conditions found in traditional network subsystems including

livelock: when the network subsystem is overcome by incoming packets such that the application doesn't have an opportunity to process the packets and the socket and eventually the IP queues fill and packets are dropped),
unfair resource accounting.

Aspects in traditional systems that give rise to problems include (note: this dissection of the problem/solutions isn't perfect but conveys the main points in the paper):

Eager receiver processing:
- Description:Capture and storage of packets is hardware interrupt driven and given highest priority; Second priority is given to protocol processing; lowest priority is given to the applications processing the packets. In other words, the early stages of receiver processing have strictly higher priority than later stages.
- Problem: Under load, can lead to
  - livelock since application never had a chance to empty queues, socket and, eventually, the IP queues can fill leading to packet drop.
  - leads to priority inversion if a low-priority receiver's packets are processed by a high priority application.
  - can lead to incorrect acctg since the running process is charged for time regardless of whether it owns packet.
- Policy: Restrict processing to receivers context and only when receiver is blocked on socket read.
- Mechanism: Early-demux to per-socket queue; Signal condition variable on which receiver is (possibly) waiting.
Lack of effective load shedding:
- Description: Packets are dropped only after significant resources have been invested.
- Problem:
  - inefficient use of resources.
- Policy: Shed packets early.
- Mechanism: Early demux to per-communication-endpt queues; When these fill, indicates to NI that further packets to that app should be dropped.
Lack of traffic separation
- Description: Incoming traffic for one receiver can affect other receivers.
- Problem:
  - A single backlogged app's full socket queue can lead to a full IP queue; packets for all applications are then dropped.
- Policy: Separate traffic such that apps cannot interfere with one another.
- Mechanism: Early demux to per-socket queue to eliminate packet drop in other apps due to full queues; restrict packet processing to receiver's context at receiver's priority to prevent priority inversion.
Inappropriate resource acctg
- Description: Packet processing time charged to process scheduled at packet arrival, not necessarily owner.
- Problem:
  - Can cause priority inversion.
  - A CPU-bound process's priority can be unfairly decreased if it is scheduled when packets arrive.
- Policy: Charge owner processes for packet processing.
- Mechanism: Do early demux to find correct owner. Restrict packet processing to receiver's context at receiver's priority.

TCP packets for whose receivers are not blocked are processed by a special application thread. The priority of the special thread reflects the application process' priority. Timely processing of TCP packets is required for efficiency (to keep the pipe full).

The send side of communication in a traditional system isn't as unfair: the processing of the packet to be sent is done in the context of the process that called send() up until the data is copied to the NI buffer. Packets queued in the NI queue are removed and transmitted in the context of the NI interrupt handler.

Regarding flow control: flow control (window size regulation) and congestion control (exponential decrease) in TCP regulates only traffic of already established connections. Network servers still vulnerable to SYN packet attacks.

Firewalls extablish a new TCP connection for each flow that passes through it.

LRP trivially depends on early demux: in order to know priority and when to schedule, must know owner.

Authors implement NI-demux in NI interface using firmware developed in Cornell's Unet project and soft demux in the network driver's interrupt handler.

If a fragment of an IP packet arrives for which the first fragment has not yet arrived, packet put on special NI queue. Uncommon case.

LRP only increases UDP delay only if the CPU is idle and receiving process is blocked on disk access (or similar thing) when packet arrives. To solve this problem, check in idle loop for packet receipt. I believe the time added is just the packet-processing time that would have happened right away in the case of traditional subsystem (see graph in paper margin).

ARP, RARP, ICMP, IP forwarding charged to special daemon processes that act as proxies for a particular protocol.

Performed a number of experiments. Showed their architecture

didn't increase latency significantly and increases throughput for UDP (5 & 12% for soft/ni demux) and decreases only slightly for TCP (3/4% for ni/soft).
(measured throughput as a function of offered rate) handled increasing load much better than traditional BSD; NI-lrp didn't decrease throughput under load and soft-lrp postponed the effect of overload significantly (at 20k pkts/sec was within 12% of BSD's max throughput).
(measured RTT as a function of background traffic rate) was effective in separating traffic such that background traffic didn't increase latency of a server process.
(measured server requests elapsed time and request throughput with three processes: one long and compute bound, and two short) yielded lower latency for CPU bound process when other IO bound processes were present (shows that the processes were effectively seperated in that the packet processing didn't harm the compute bound process) and LRP yielded higher throughput (shows that LRP scheme reduced the context switch overhead and had better locality due to executing in receiver's context).
(measured HTTP transfers/second as a function of SYN packet rate/sec) withstood a SYN packet attack (similar to previous experiment, but more realistic scenario); at 20000 SYN packets/sec, HTTP transfers were at 50% of max; BSD experienced livelock at 10000 packets/sec.

The performance section is worth rereading.