Scalable Data Naming for Application Level Framing in Reliable Multicast

Suchitra Raman and Steven McCanne (UCB), 1998

Summary. Current transport level protocols name data with sequential numbers, a scheme that does not permit receivers to "flexibly tailor their reliability semantics" leading to more efficient use of the underlying network. The authors propose a naming scheme that exposes the structure of application data to the transport layer, enhancing the applications's control over the reliability.

More Detail

Announce-listen. In this approach to RM, a source periodically multicasts the data to some agreed upon address and interested parties listen in. Eventually, all parties will have a consistent copy of the data. Although simple and robust, announce-listen is suboptimal for two reasons: does not scale well in terms of bandwidth and the time-to-convergence may be too great.

"SRM adds a level of indirection to the announce/listen framework. In SRM, a source enlists the announce/listen framework to disseminate 'meta-data' that summarizes al the available data without actually sending it...As an optimization, a source might multicast data once upon creation to avoid delay."

ALF. To perform receiver-reliability on traditional protocols, need:

rich naming scheme -> not mapping to linear namespace
scalability

SNAP solves (1) with two-dimensional naming scheme: it uses announce/listen to disseminate page structure that provides a common-syntax that enables receiver reliability. SNAP addresses (2) by exploiting hierarchy to gain a level-of-indirection by providing a summary fingerprint.

SNAP Overview.

Namespace: Hierarchical, per-source structure onto which an application maps data.
ADU: Smallest unit of data that can be processed independently; . ADUs named within a container sequentially starting from zero. Bytes within an ADU are named sequentially starting from zero. ADU's are named with the pair (CID, ADU seqno).
Fragments: Fragments within an ADU are numbered using byte boundaries. To signify that more ADU fragments are expected, have a "more" bit set to 1. A "more" bit of 0 indicates the last fragment in an ADU. Since the ADU is the smallest unit of data that an app sees, only the transport layer (SRM) sees fragments.
Container: Nodes in the namespace hierarchy; containers can contain ADUs, other containers, or both; containers may contain just one ADU. Containers named using CID's assigned sequentially starting at a randomly chosen Initial CID (ICID). Containers are the unit of reliability.
Namemap Container: A special container that resides at the left-most position of the namespace tree. Permits re-use of "protocol machinery".

Signature: Representation of a namespace;

		s(C) = right_edge, if C is a leaf-level container;
		       h(right_edge, s(C1),...s(Ck)), otherwise

		where h() is the MD-5 hash function, k is the
		number of children that C contains, and right_edge
		is the pair (max ADU Seqno, last byte offset in last ADU) 
		(latter in case ADU not done being transmitted).

Sources multicast their namespace signatures in periodic Session Announcements.

Session announcments: s(namespace root) plus some number of container signatures. Which containers are sent out is chosen heuristically based on the amount of time passed since their signature was last sent out.
Don't care containers: Since apps may practice selective reliability, they may not care about certain repairing certain containers. For these containers, the most current signature is stored to prevent spurious signature mismatches higher up in the tree.

Fragmentation. Relying on IP to do link-level frag and reassembly can be inefficient since the loss of one fragment could result in the retransmission (multicast) of an entire packet. To avoid fragmentation, SRMv2 uses 1024 byte MTU.

Source IDs. The Source ID needs to meet the following three requirements:

Uniqueness. RTP uses random 32-bit words; okay for non-reliable protocol, but not reliable protocol.
Time invariance. Unlike RTP apps where "app state is ephemeral", rm apps require that a source have the same state (id) throughout the entire session - possibly even through leave and joins / crashes and recoveries.
Terminal independence. A source could move from machine to machine; can't use IP address/port pair.

SNAP uses MD-5 hashed, application-supplied arbitrary strings.

Loss Recovery \ Namespace Discovery. Non-tail losses are detected via gaps in sequence numbers (fragments), CID's (containers), ADU seqnos (ADUs). Tail losses are discovered via mismatches between signatures sent out in session announcments.

Should losses in anything other than fragments be detected (fragment loss can be detected but not repaired because the app supplies repair data and it does not deal in fragments - only ADUs), SRMv2 informs the application. It is then the applications responsibility to determine if a repair is desired and, if so, trigger the repair.

Assuming the app desires repairs, if a signature mismatch is detected, a namespace repair request is sent. In response, a list of the children and their signatures is sent. The source can then traverse the response and send repair requests for specific ADU's. If a whole container is lost (caused by a tail loss in the name map), the container name in the map and the ADU's are repaired. (?)

A namespace repair request is expected not to be the common case; the common case is that the container signature will be sent in the session announcent (not sure what the reasoning is here).

Performance. Three components to delay for tail losses:

State update (can't detect until
SRM damping delay
Repair response time

Showed constant (non-increasing) number of duplicates for requests and replies. Showed linear convergence time as a function of the state update period.

API. API presented.

Questions/Comments.

Why is SNAP restricted to reliable multicast? Oh - in intro, do mention that ALF replies to unicast as well as multicast.
In section 2, page 3, paragraph 2, it is explained that upon receipt of the fragment that has the "more" bit set to 0, srm checks if the ADU is complete. If so, passes to application. This check should be done after every fragment is received in case of out-of-order receipt, shouldn't it?
Why a more bit to detect fragment loss? Out-of-order f arrivals can make this complicated. Why not just use an ADU size? Aren't these known at the outset based on a bytecount passed to srm_start()?
What about ATM networks? Perhaps these aren't common enough - they're usually leaf networks.
Container vs. Node terminology.
Say that the probability of source name collision is the same as in RTP, but the SNAP name is 64-bit and RTP is 32-bit. Why is the probability the same?
What happens if I decide I care about a Don't Care container? Does the app notify SRM that it would like to send a request?
Performance only went to 55 nodes due to ns constraints. Would be interesting what happens with more nodes.
How do you do resource control acctg on SRM repairs?

Other:

leaky bucket
token bucket
ADU as defined in end-to-end paper