Efficient Packet Demultiplexing for Multiple Endpoints and Large Messages

M. Yuhara (Fujitsu), B. Bershad (UW), C. Maeda (CMU), J. Moss (UMass), 1994

Summary. The authors found that in the context of a microkernel popular packet filters were lacking in two ways: they were not scalable with respect to the number of filters and they did not support fragmented packets sufficiently. They extended the BPF virtual machine with instructions that 1) collapsed filters for identical transport protocols into a single filter and 2) allowed state (such as a UDP message ID) to be associated with a filter to support demultiplexing of fragments.


More Detail

Problem: In an effort to support network subsystems implemented at user-level in the Mach microkernel, the authors used BPF to demultiplex packets to the user-level network subsystem module. In this context, the PF didn't need to be scalable since there were commonly only two network modules to which to multiplex nor support fragmentation since the user-level network module would handle fragmentation. Unfortunately, they found that this architecture introduced a bottleneck at the single TCP/IP protocol module so they put a TCP/IP stack in each user program that required it. This eliminated the single-network-module bottleneck, but introduced scalability problems since hundreds of applications could require network support thereby requiring the PF mechanism to process hundreds of filters. Furthermore, fragments had to be sent to a higher-level intermediary process to await the arrival of peer fragment which requires more address space crossings.

The authors solved the problems of scalability and fragmentation by extending the existing BPF instruction set. The authors recognized that most packets belong to a few transport protocols. If the virtual machine were able to recognize filters with the same tranport protocols, it could collapse all the common filter tests into one and then incrementally perform the distinct tests. They accomplished this collapsing with the introduction of the ret_match_imm instruction:

	ret_match_imm #3, #ALL
	key #key0
	key #key1
	key #key2
	
The first argument indicates the number of data items which follow to be compared. The subsequent key pseudo-instructions provide immediate data to. The immediate data is compared with data stored in the scratch memory registers (see BPF paper for details on the VM architecture). The resulting program has two sections: one which performs the tests common to a transport protocol (for ex: ip-proto, tcp-proto, dest-ip-addr, dest-tcp-port) and one which performs the tests specific to and endpoint (src-ip-addr, src-tcp-port, dest-tcp-port).

The defragmentation problem is solved (note that the packet filter does not handle the defragmentation, only the demultiplexing to the correct endpoint) with the addition of per-packet state. That state is intended to be a context-independent flag that the packet should be directed to the given endpoint. For example, upon arrival of a the first packet in a fragmented UDP packet, dispatch information found only in the header (such as the UDP port number) can be linked to a per-packet state (such as the UDP message ID). In this way, the first and subsequent packets can be demultiplexed immediately. The per-packet state is manipulated with these new VM instructions:

	jmp_match_imm #N, Lt, Lf	; identify a first fragment
	register_data #N, #T		; bind an endpoint to a msg id for T ms
	ret_match_data #N, #R		; return a fragment to correct endpt
	postpone #T			; postpone processing for T ms
					;   because first frag missing
	

Filters are implemented using hash tables. Each transport protocol has a hash table; each filter can have a fragmentation-state hash table.

Performance. Latency remains constant as number of filters increases when filtering the same protocol; BPF and CSPF latency increases. Because the program has to be analyzed in the MPF model, packet installation overhead higher; this is a big deal only if


Comments and Concerns