Lazy Receiver Processing (LRP): A Network Subsystem
Architecture for Server Systems
Peter Druschel and Gaurav Banga (Rice), 1996
Summary.
Network servers can experience slowdown and even livelock
under overload. The authors present a network subsystem architecture
which alleviates (or eliminates if the NI is programmable) these
overload problems via a combination of early-demultiplexing and
lazy receiver processing.
More Detail
The proposed architecture depends upon both early demux and LRP
to provide increased throughput under increased load. Below describes
how each can result in one or more pathological conditions found in
traditional network subsystems including
- livelock: when the network subsystem
is overcome by incoming packets such that the application doesn't have an
opportunity to process the packets and the socket and eventually the IP
queues fill and packets are dropped),
- unfair resource accounting.
Aspects in traditional systems that give rise to problems include (note:
this dissection of the problem/solutions isn't perfect but conveys the main
points in the paper):
- Eager receiver processing:
- Description:Capture and storage of packets is hardware
interrupt driven and given highest priority; Second priority is given
to protocol processing; lowest priority is given to the applications processing the
packets. In other words, the early stages of receiver processing have
strictly higher priority than later stages.
- Problem: Under load, can lead to
- livelock since application never had a chance to empty queues, socket
and, eventually, the IP queues can fill leading to packet drop.
- leads to priority inversion if a low-priority receiver's packets are
processed by a high priority application.
- can lead to incorrect acctg since the running process is charged for
time regardless of whether it owns packet.
- Policy: Restrict processing to receivers context and
only when receiver is blocked on socket read.
- Mechanism: Early-demux to per-socket queue; Signal condition
variable on which receiver is (possibly) waiting.
- Lack of effective load shedding:
- Description: Packets are dropped only after significant resources
have been invested.
- Problem:
- inefficient use of resources.
- Policy: Shed packets early.
- Mechanism: Early demux to per-communication-endpt queues; When
these fill, indicates to NI that further packets to that app should be
dropped.
- Lack of traffic separation
- Description: Incoming traffic for one receiver can affect other
receivers.
- Problem:
- A single backlogged app's full socket queue can lead to a full IP
queue; packets for all applications are then dropped.
- Policy: Separate traffic such that apps cannot interfere
with one another.
- Mechanism: Early demux to per-socket queue to eliminate
packet drop in other apps due to full queues; restrict packet
processing to receiver's context at receiver's priority to prevent
priority inversion.
- Inappropriate resource acctg
- Description: Packet processing time charged to process
scheduled at packet arrival, not necessarily owner.
- Problem:
- Can cause priority inversion.
- A CPU-bound process's priority can be unfairly decreased if it is scheduled when
packets arrive.
- Policy: Charge owner processes for packet processing.
- Mechanism: Do early demux to find correct owner.
Restrict packet processing to receiver's context
at receiver's priority.
TCP packets for whose receivers are not blocked are processed by a
special application thread. The priority of the special thread reflects
the application process' priority. Timely processing of TCP packets is required
for efficiency (to keep the pipe full).
The send side of communication in a traditional system isn't as unfair:
the processing of the packet to be sent is done in the context of the
process that called send() up until the data is copied to the NI buffer.
Packets queued in the NI queue are removed and transmitted in the context of
the NI interrupt handler.
Regarding flow control: flow control (window size regulation) and
congestion control (exponential decrease) in TCP regulates only traffic of
already established connections. Network servers still vulnerable to
SYN packet attacks.
Firewalls extablish a new TCP connection for each flow that passes through
it.
LRP trivially depends on early demux: in order to know priority and when
to schedule, must know owner.
Authors implement NI-demux in NI interface using firmware developed in
Cornell's Unet project and soft demux in the network driver's interrupt
handler.
If a fragment of an IP packet arrives for which the first fragment has not
yet arrived, packet put on special NI queue. Uncommon case.
LRP only increases UDP delay only if the CPU is idle and receiving process is
blocked on disk access (or similar thing) when packet arrives. To
solve this problem, check in idle loop for packet receipt. I believe the time
added is just the packet-processing time that would have happened right away
in the case of traditional subsystem (see graph in paper margin).
ARP, RARP, ICMP, IP forwarding charged to special daemon processes that act
as proxies for a particular protocol.
Performed a number of experiments. Showed their architecture
- didn't increase latency significantly and increases throughput
for UDP (5 & 12% for soft/ni demux) and decreases only slightly for TCP (3/4% for
ni/soft).
- (measured throughput as a function of offered rate) handled increasing load
much better than traditional BSD; NI-lrp didn't decrease throughput under load
and soft-lrp postponed the effect of overload significantly (at 20k pkts/sec
was within 12% of BSD's max throughput).
- (measured RTT as a function of background traffic rate) was effective in
separating traffic such that background traffic didn't increase latency of
a server process.
- (measured server requests elapsed time and request throughput with three
processes: one long and compute bound, and two short) yielded lower latency
for CPU bound process when other IO bound processes were present (shows that
the processes were effectively seperated in that the packet processing didn't
harm the compute bound process) and LRP yielded higher throughput (shows that
LRP scheme reduced the context switch overhead and had better locality due
to executing in receiver's context).
- (measured HTTP transfers/second as a function of SYN packet rate/sec) withstood
a SYN packet attack (similar to previous experiment, but more realistic scenario);
at 20000 SYN packets/sec, HTTP transfers were at 50% of max; BSD experienced
livelock at 10000 packets/sec.
-
The performance section is worth rereading.