Efficient Packet Demultiplexing for Multiple Endpoints and Large Messages

M. Yuhara (Fujitsu), B. Bershad (UW), C. Maeda (CMU), J. Moss (UMass), 1994

Summary. The authors found that in the context of a microkernel popular packet filters were lacking in two ways: they were not scalable with respect to the number of filters and they did not support fragmented packets sufficiently. They extended the BPF virtual machine with instructions that 1) collapsed filters for identical transport protocols into a single filter and 2) allowed state (such as a UDP message ID) to be associated with a filter to support demultiplexing of fragments.

More Detail

Problem: In an effort to support network subsystems implemented at user-level in the Mach microkernel, the authors used BPF to demultiplex packets to the user-level network subsystem module. In this context, the PF didn't need to be scalable since there were commonly only two network modules to which to multiplex nor support fragmentation since the user-level network module would handle fragmentation. Unfortunately, they found that this architecture introduced a bottleneck at the single TCP/IP protocol module so they put a TCP/IP stack in each user program that required it. This eliminated the single-network-module bottleneck, but introduced scalability problems since hundreds of applications could require network support thereby requiring the PF mechanism to process hundreds of filters. Furthermore, fragments had to be sent to a higher-level intermediary process to await the arrival of peer fragment which requires more address space crossings.

The authors solved the problems of scalability and fragmentation by extending the existing BPF instruction set. The authors recognized that most packets belong to a few transport protocols. If the virtual machine were able to recognize filters with the same tranport protocols, it could collapse all the common filter tests into one and then incrementally perform the distinct tests. They accomplished this collapsing with the introduction of the ret_match_imm instruction:

	ret_match_imm #3, #ALL
	key #key0
	key #key1
	key #key2

The first argument indicates the number of data items which follow to be compared. The subsequent key pseudo-instructions provide immediate data to. The immediate data is compared with data stored in the scratch memory registers (see BPF paper for details on the VM architecture). The resulting program has two sections: one which performs the tests common to a transport protocol (for ex: ip-proto, tcp-proto, dest-ip-addr, dest-tcp-port) and one which performs the tests specific to and endpoint (src-ip-addr, src-tcp-port, dest-tcp-port).

The defragmentation problem is solved (note that the packet filter does not handle the defragmentation, only the demultiplexing to the correct endpoint) with the addition of per-packet state. That state is intended to be a context-independent flag that the packet should be directed to the given endpoint. For example, upon arrival of a the first packet in a fragmented UDP packet, dispatch information found only in the header (such as the UDP port number) can be linked to a per-packet state (such as the UDP message ID). In this way, the first and subsequent packets can be demultiplexed immediately. The per-packet state is manipulated with these new VM instructions:

	jmp_match_imm #N, Lt, Lf	; identify a first fragment
	register_data #N, #T		; bind an endpoint to a msg id for T ms
	ret_match_data #N, #R		; return a fragment to correct endpt
	postpone #T			; postpone processing for T ms
					;   because first frag missing

Filters are implemented using hash tables. Each transport protocol has a hash table; each filter can have a fragmentation-state hash table.

Performance. Latency remains constant as number of filters increases when filtering the same protocol; BPF and CSPF latency increases. Because the program has to be analyzed in the MPF model, packet installation overhead higher; this is a big deal only if

Comments and Concerns

MPF is only a win if many flows per protocol -- this is the common case, probably. Most things UDP or TCP.
Mentioned that several hundred protocol endpoints are "common" in a workstation-class machine.
Didn't explain how the higher layers determine how to write the filter programs in a way that can leverage already existing filter programs. I assume that this has to be programmed into the higher level app.
Can we swap EE out for transport protocol? This way, we can optimize for EE. Seems potentially applicable to Janos.
The authors claim that their approach is general and I think it could be, but they do not explain how their framework can be generalized; they use it only for improvements to specific protocols. Initial design goals for CSPF and BPF included protocol independence.
Could their improvements be accomplished via analysis in the interpreter?
Regarding performance study:
- No performance numbers for case with multiple protocols.
- What effect did running BPF and CSPF on a microkernel have on the results, if any?
- Didn't run any benchmarks that measured improved performance of the end application. Did packet throughput improve and by how much? The time in the kernel interrupt handler, message transmission time, and the time to dispatch (copy) a message to the endpoint's address space are constant and would dominate total message latency. Would be interesting to see new packet filter's impact.
- The author's use of the term "protocol" is ambiguous. After some reading, it becomes clear that they usually mean "transport protocol", but it would be better to clarify that earlier. Making this clear would probably also lead to improving their case about generality, too.