Sai Susarla
In this thesis, we propose to demonstrate the feasibility of implementing such a flexible consistency management system in the context of a wide-area data storage infrastructure, and its value in catering to a variety of application needs. For this, we will build an experimental prototype file storage system that provides coherent file access at page-granularity across a WAN. We will implement the following consistency options on a per-file basis: strict (pessimistic, lease-based), (optimistic) append-only and last-writer-wins consistency. We will also give applications control over various aspects of consistency implementation such as frequency and direction of update propagation and the level of tolerance to staleness of data.
We demonstrate the value of our consistency management framework by building four representative services from a variety of distinct classes. We show how the set of consistency options and the control parameters that we support are sufficient to provide the appropriate level of data coherence in these applications efficiently. The four services are: 1) a simple distributed file store that supports a variety of file-sharing patterns across the wide-area; 2) a directory implemented as a page-based hash table within a file to illustrate strictly consistent access; 3) a chat room service implemented as concurrent appends to a single file, illustrative of distributed data/event logging and many-to-many data-flow; 4) a scoreboard application that broadcasts the status of an ongoing game updated at a site to registered listeners across the Internet., to illustrate one-to-many data dissemination with ability to tune the amount of update traffic generated based on network conditions and how much staleness can be tolerated by listeners.
In this proposal, we consider the problem of maintaining data consistency in wide-area data-intensive applications. Many Internet services are data-intensive. They involve the access and delivery of shared data among distributed components, e.g., clients and servers. Some example services include shared file services, the WWW, email, Usenet, directory services, data logging, and streaming and publish-subscribe applications. Many distributed services typically cache/replicate data close to where it is accessed to improve availability and access latency. Caching is especially beneficial across the wide-area to hide the effects of variable communication latencies and bandwidths and intermittent connectivity typical of the wide-area environment.
However, caching introduces the problem of maintaining consistency among data replicas. A key issue in providing consistent access to replicated data is the tension between the degree of replica coherence and the communication costs required to maintain it. This problem of consistency management is complicated in the wide-area environment by non-uniform network characteristics between nodes, the potentially large numbers of clients present, and the infeasibility of obtaining global state for making local decisions. Several consistency schemes have been developed to address this tension in the context of specific wide-area services to suit specific data access patterns [13,20,37,16,28,32]. We observe (in Chapter 3) that many of these services employ different combinations of a common set of mechanisms to handle various aspects of consistency management, such as concurrency control and update propagation. However, each existing consistency solution inseparably bundles together the individual mechanisms that suit the service requirements for efficiency. Although this approach provides good performance for that service, it hinders the reuse of these consistency mechanisms in a different combination to accommodate different service requirements. As a result, wide-area distributed services involving replicated data management are hard to develop and deploy, because the consistency solution for each service is essentially developed and tuned from scratch.
Managing the consistency of cached/replicated data requires significant development effort, despite the emergence of many useful mechanisms in the context of individual applications. This is because many existing services integrate the collection of consistency mechanisms they need in a monolithic design. As a result, individual mechanisms cannot be isolated easily from various existing services and reused together in a new service, without significant engineering effort. If the consistency needs of the new service are not almost identical to those of some existing service, it becomes very difficult to reuse existing algorithms or code and achieve good performance.
In this thesis, we address the problem of developing a flexible and reusable consistency management framework that relieves some of the development burden for a variety of Internet services. For the purpose of this thesis, we focus on supporting the data coherence needs of applications involving (i) read/write access to shared data/state, (ii) data dissemination, and/or (iii) data exchange among distributed components over the wide-area.
Existing consistency mechanisms vary with respect to when they allow access to a replica, when and how they propagate updates to synchronize replicas, and how they detect and resolve update conflicts (if they arise). The right set of policies to employ depends heavily on the individual application's data access pattern. Hence, application control over these policies is essential for the efficiency and scalability of the resulting system. We believe that many application-level consistency requirements can be translated into a few core system policy choices that can be modified to tailor the consistency implementation for each individual application. Some choices that we identified are: push- vs. pull-based update transfer, multiple update transfer conditions, and a variety of consistency guarantees. Hence, letting applications directly make these choices helps achieve efficient consistency management.
We hypothesize that:
A consistency management system that provides a small set of customizable
coherence mechanisms can efficiently satisfy the data coherence needs of a wide
variety of distributed applications.
To support this hypothesis, we propose to develop a system that can efficiently provide consistent shared data access and data delivery to a variety of useful wide-area applications by exporting a common set of mechanisms needed by them. Its mechanisms are reusable and their behavior can be customized to application-specific needs for efficiency. Using our system will relieve application writers of much of the burden of implementing wide-area consistency management, and will enable them to focus on other issues.
In this thesis, we only address the issue of maintaining consistency among a given network of data replicas across the wide-area, while handling variable latencies and bandwidths among the network links. There are many other important issues to wide-area data access that are out of the scope of this thesis, including:
A replica network can be leveraged not only to minimize the raw communication costs between nodes, but also to limit the consistency management state for scalability. For this thesis, we limit our focus to techniques to leverage a given replica hierarchy for scalable consistency management. However, for our evaluation prototype, we employ a multi-level caching scheme that works well for a variety of applications. This scheme creates and dissolves cached copies based on usage and links them by a dynamic replica hierarchy to reduce the cost of consistency-related communication by taking into account the network quality between copy sites. Although our scheme may not always create an optimal replica topology, the design of our framework allows it to be easily replaced by better schemes (such as those of Bayeux [40] and Scribe [31]) without reducing the effectiveness of our framework.
Replication for fault tolerance is orthogonal to the problems addressed by our thesis. Our focus is on keeping a set of secondary replicas (i.e., transient copies) coherent with each other and a primary replica (i.e., a permanent copy) of a data item in the face of updates. In contrast, replication for fault tolerance typically involves maintaining multiple primary replicas in sync with each other to protect against permanent loss of some of them. We believe that our system can coexist with many fault-tolerance solutions such as clustered services [14] and other quorum/voting replication schemes [19,27]. Their protocols only operate among primary replicas. They have little interaction with secondary replicas and hence little impact on the consistency management issues addressed by this thesis.
Lastly, protection of data against unauthorized access is especially important across the wide-area, and a heterogeneous caching solution is not very useful without an effective security solution. However, this thesis addresses some other important issues that are orthogonal to security. We believe that the consistency management solution that we propose can be easily modified to incorporate a security solution, but we leave the exploration of their interaction for future research.
Our approach to achieving scalability and performance in the wide-area is to adopt a multi-level caching network similar to that proposed by Blaze [1], but to make use of it in novel ways. We use this network for supporting multiple coherence policies. Our implementation of these policies takes the non-uniform communication costs between nodes into account for consistency-related communication. We provide multiple policy-independent update propagation mechanisms that leverage the replica hierarchy for scalability. We delegate workload among replicas based on network proximity. We resolve update conflicts at each level of the replica hierarchy to limit their spread. Each replica only needs to maintain consistency management state for its immediate neighbors in the replica hierarchy. This helps our system scale to a large number of replicas and clients. These design issues are unique to the wide area and do not arise in a LAN environment due to its uniform and relatively low communication costs.
Previous systems provide one or the other of scalable wide-area support and flexible consistency in the form of multiple consistency policies. Our approach is unique in that in addition to supporting multiple policies, it also gives applications choice over a variety of mechanisms for enforcing these policies over the wide-area. This choice enables our system to support a wider variety of applications efficiently. To identify the aspects in which application control over consistency mechanisms is beneficial, we examine the problem of consistency management closely in Chapter 3. We identified various techniques based on issues such as when access is allowed to a replica (e.g., optimistic vs. pessimistic access control), where can updates be issued (e.g., single-master vs. multi-master), what direction do updates get transfered (e.g., pull-based vs. push-based updates), and the conditions that can trigger update transfer (e.g., timer-driven, staleness-driven, eager or lazy transfer). These issues are common to many consistency policies. However, existing systems bundle these decisions inseparably into a policy's implementation.
The success of our thesis lies in demonstrating that scalable flexible consistency management over the wide-area is both viable and useful. We propose to demonstrate its viability by showing that it can be implemented in a manner that is both scalable and resilient to node/network failures and intermittent connectivity, which is essential in a wide-area environment. The availability of an application's data for access in partition situations is determined by the consistency guarantees desired, and is thus application-dependent. We propose to show that our design can scale reasonably well with workload along several dimensions: 1) number of clients, 2) number of servers, 3) number of objects, 4) number of replicas per object, 5) object access rate (operations per sec), and 6) variability of network quality (latency and bandwidth).
We also propose to demonstrate that our approach has the following useful benefits over existing approaches to consistency management. First, it enables the reuse of consistency management techniques across a variety of wide-area applications. Second, support for multiple coexisting consistency policies lets an application meet the specific consistency, availability and intermittent connectivity handling needs of its individual data items. This support also lets the application-writer evaluate the suitability of various options for a given data item without significant implementation effort. Third, the flexibility offered by our approach in the choice of implementation techniques helps an application to tune its consistency to achieve good performance and scalability. Lastly, application control over update propagation strategy on a per-replica basis helps meet the diverse data quality requirements of different clients for a given data item within an application.
To validate these claims, we propose to implement our consistency management system in the context of a wide-area data store that we are developing called Khazana, which we describe in detail in Chapter 4. Khazana is organized as a collection of peer data servers that cooperate to provide coherent access to flat files across a wide-area. It provides user choice over a variety of consistency policies and implementation techniques on a per-file basis, which we describe in Chapter 3. We then propose to develop and evaluate four applications with distinct data access patterns and consistency requirements using Khazana, that are representative of the three classes of applications mentioned in Section 1.1.
We plan to conduct a series of measuring experiments with these applications to validate each of our claims. The first experiment demonstrates how our system scales to a large number of files with different consistency choices coexisting in a single wide-area application. The second experiment shows how the customizability of mechanisms for implementing a given consistency policy can improve application performance. The third experiment evaluates the ability of our system to deal with intermittent connectivity. It demonstrates how a variety of consistency and availability tradeoffs can be achieved using our framework. The fourth experiment demonstrates that our system efficiently supports distributed data structures with stringent consistency requirements, such as those used in a directory service. The fifth experiment tests the latency and efficiency of real-time data dissemination in Khazana and shows that our asynchronous update-propagation approach helps achieve scalable real-time data exchange. The last experiment evaluates how the ability to customize update propagation strategy enables us to provide a variety of data quality levels at individual replica sites. We describe these experiments in detail in Chapter 5.
Lastly, we plan to evaluate the performance and scalability of two real-world applications based on our framework namely, a chat room service and a directory application, relative to a popular existing implementation. We illustrate how the versions based on our system are simplified by using our framework but can still perform reasonably when compared to existing versions of these applications.
In this thesis, we propose to show the feasibility and value of implementing a wide-area consistency management system that is flexible enough to satisfy the data access and delivery needs of a variety of wide-area applications. We target applications in the three classes, namely those involving 1) read/write access to shared data/state, 2) data dissemination and 3) data exchange among distributed components over the wide-area. Our approach to achieving scalability, flexibility and performance is to provide a set of useful consistency guarantees for our targeted application classes while giving applications significant control over their implementation.
Our intended contributions are:
Much research effort has gone into building distributed services ranging from cluster-based services to wide-area services. Many of these services have addressed the important problem of maintaining data consistency in the face of distributed data access. Distributed application developers have traditionally employed some combination of two distinct programming paradigms namely, message-passing and shared data. Applications using the message-passing paradigm explicitly partition shared state among components and perform all data management themselves via explicit communication and only utilize the system's support for communication. This approach results in a more efficient implementation, but leaves the burden of data management on the application programmer. In contrast, shared state-based applications assume an underlying shared data management system and layer application logic on top, while leaving data replication and consistency issues to the system. This approach relieves programmer burden by handling all the hard issues of data management behind a familiar programming interface. However, it often suffers from poor performance due to: 1) mismatch between the techniques employed by the system and the application's needs, and 2) lack of a flexible interface for communication of these needs between the application and the system for better adaptation. To avoid this inefficiency, most distributed service developers still build custom data management solutions on top of a message-passing infrastructure, and hardwire service-specific knowledge into the data management logic to achieve good performance. For the purpose of this thesis, we examine a specific aspect of this problem, namely consistency management of shared data, and see how much support is available for distributed applications in this regard.
Previous systems in this area can be broadly classified into three categories: systems that provide flexible consistency management, systems that invent novel consistency guarantees to suit specific data access patterns, and systems that explore efficient techniques to implement specific data coherence schemes over the wide-area in the context of individual applications. In this Chapter, we examine existing systems in these three categories closely to give a perspective for our research. Our work builds upon the insights gained from observing these systems and the applications using them.
Flexibility in consistency management has often taken the form of supporting multiple selectable consistency options for application data. Munin [3] provides flexible/tunable consistency policies in a LAN environment for distributed shared memory-based parallel programming. WebFS [36] provides a useful set of coherence policies for wide-area file access and update multicasts. Fluid replication [5] provides intelligent data placement and multiple consistency policies for mobile file access across the wide-area. TACT [39] provides optimistic consistency with application-specified notion of coherence in a wide-area environment. The flexibility in consistency options provided by these systems enables their reuse for a variety of applications. However, these systems are limited in the hooks they provide to let an application customize the implementation of their policies. As we will demonstrate in our evaluation, the customizability of the underlying implementation (e.g., pull- vs. push-based update propagation) is important for the performance of some applications. Moreover, these systems do not adequately address the issues of dealing with variable network latencies and/or scalability (to a large set of clients) typical of the wide-area in consistency-related communication. In our work, we propose to build on their success by providing similar flexibility while dealing with some of the wide-area issues listed above.
Providing data coherence across the wide-area while dealing with variable network latencies, disconnections and partitions is not new. However, previous work has been done in the context of specific data access patterns. AFS provides last-writer consistency in a wide-area file system with traditional Unix-style file access. The Coda file system [20] provides support for weakly connected and disconnected operation in a campus environment employing a client-server architecture. Ficus [29] provides replicated file access while handling intermittent connectivity in a more peer-to-peer setting. Bayou [7] provides optimistic consistency for a replicated database in a peer-to-peer mobile environment with highly intermittent connectivity and unstructured communication patterns. It requires application-specific conflict resolution procedures to be supplied. As it relies on ad hoc communication patterns among mobile replicas for update propagation, replicas can take unbounded time to synchronize. In contrast, we propose to employ a hierarchical replica topology to achieve more bounded synchronization. Blaze's PhD thesis [1] showed the value of constructing dynamic cache hierarchies on a per-file basis among clients to support efficient large-scale file sharing and to reduce server load in distributed file systems. These systems have developed many effective techniques to achieve performance and scalability in a wide-area environment. We propose to leverage some of these techniques in a more general-purpose consistency management framework for the benefit of a wider variety of applications.
Numerous consistency options have been developed individually to handle the data coherence needs of specific services such as file systems, directory services [24], databases [26] and persistent object systems [22,9,38,15]. Many algorithms have been proposed to maintain consistency among data replicas for fault-tolerance in clustered as well as wide-area environments in the face of network partitions (e.g., quorum consensus [12], Deno [19]). The goal of our work is to design a framework in which these and other algorithms can be reused by a variety of applications. Saito presents a comprehensive survey [32] of optimistic consistency schemes with a taxonomy for classifying them based on their implementation choices. We have extended his taxonomy to include pessimistic algorithms also, and used it to arrive at a set of tuning parameters to provide flexible consistency management, which we describe in Chapter 3.
Persistent shared object systems such as Thor [22], PerDis [9], Globe [38] and Legion [15] provide object caching support with serializability for transaction semantics. The OceanStore [21] project aims at providing a secure, global scale persistent data utility. Objects in OceanStore are immutable and consistency is maintained based on versioning instead of in-place updates. They employ update procedures to apply and propagate updates among replicas, similar to Bayou.
Recent popularity of Internet-based file sharing (e.g., Napster [18]) has prompted research into peer-to-peer organization of distributed applications, in which any component can dynamically take on any role (e.g., client or server) in the system. Much of this work has focussed mainly on anonymous information dissemination (e.g., Freenet) and/or immutable deep archival storage among peers as opposed to handling concurrent updates to replicated data. Pastry [30], Chord [35] and Tapestry [40] provide efficient object location and request routing across the wide-area based on hashed object IDs. PAST [30] and Farsite [2] provide cooperative file storage and retrieval, but do not address consistency among multiple replicas in the face of updates. Our consistency management work is complementary to these systems, as it can add consistent update functionality to them, making their benefits applicable for a wider range of services.
The application classes that we target have been explored individually by different systems. In particular, applications based on wide-area data dissemination and exchange currently employ message-passing using some form of multicasting support such as MBone [23] or IP multicast [6]. Our approach to achieve data multicast as a side-effect of update propagation among data replicas, though not as efficient as these highly tuned solutions, has several benefits. It allows intermediate data staging along the multicast path to cache data closer to its clients. It enables the quality of multicast-ed data and frequency of update transfer to be customized at intermediate hops to the consistency level needed by individual replicas, unlike IP multicast, where data quality is controlled only at the sender, not at intermediate hops. Scribe [31] and Bayeux [41] construct multicast trees for efficient data/event dissemination among data replicas, by virtue of being built on top of topology-aware data location and routing schemes such as Pastry and Tapestry (respectively). Our focus is on making use of a multicast tree efficiently, once it is built, by employing data staging along the tree nodes.
In summary, our consistency management framework enables efficient reuse of a variety of consistency management techniques and options developed by prior research, by providing them to applications in a wide-area environment with significant control over their implementation.
In this Chapter, we examine the issues in providing data consistency and availability in a general-purpose manner across the wide-area.
A common technique to increase data availability and reduce access latency in a distributed service is to cache/replicate data on nodes close to its access. However, concurrent updates to these copies3.1 requires a mechanism to keep them coherent. Fox et al. [10] postulate that there is a fundamental tension between the degree of data consistency (C) and availability (A) achievable in an application and its ability to handle network Partitions (P). They called it the CAP principle. According to this principle, we can get at most two out of these three properties for a given piece of data. P is important for wide-area applications. Hence, if we need high availability in a wide-area application, we have to degrade its consistency requirement appropriately by choosing a weaker guarantee policy for its data. We cannot achieve these properties independently. Regarding disconnections and partitions, we have to refer back to the CAP principle. It is up to the semantics of a given consistency policy to decide what level of C and A can be provided in a partition situation. For instance, if an application needs strict consistency among copies across a wide-area, it must employ some voting/quorum algorithm such as Deno [19] to restrict data availability in a partition situation. Fortunately, not all Internet services need such strict guarantees; many of them can live with weaker guarantees, thus allowing a wide spectrum of choice in the degrees of C, A and P in practice.
In this thesis, our focus is on providing a wide-area framework for supporting a variety of consistency policies, each of which provide a specific combination of degrees of consistency, availability and behavior in intermittent connectivity situations. In particular, we propose to explore the issues in actually implementing multiple policies in the wide-area, exploiting any common mechanisms they can take advantage of, and providing application control for effectively dealing with variations in network quality.
A consistency policy is a contract between the application and the system regarding data access. If the application adheres to system-imposed conventions for data access, the system guarantees to maintain certain invariants about the quality of data accessed. A consistency mechanism is a set of actions taken by the system's components to enforce this contract. These actions include the following: 1) controlling admission of access and update operations on data replicas, 2) propagation of updates among replicas to maintain desired invariants about data quality, and 3) handling conflicting updates (if any).
Consistency mechanisms vary with respect to when they allow access to a replica, when and how they propagate updates to synchronize replicas, and how they detect and resolve update conflicts (if they arise). The right policy for each of these design issues depends heavily on the application's data access patterns. Hence, application control over these issues is essential for the efficiency and scalability of the resulting system. In this Section, we classify consistency mechanisms along a few aspects listed below. For this purpose, we extend the taxonomy described by Saito [32] by several dimensions.
This determines whether an access/update request is satisfied after or before bringing it to desired consistency level. Pessimistic schemes (either lazily or eagerly) propagate updates to achieve desired level of coherence before allowing access. Optimistic schemes allow immediate access to any data replica, but perform update propagation after data access in the hope that conflicting updates are rare and/or can be easily resolved. With optimistic schemes there is a risk of stale reads and conflicting writes, as updates may not propagate in time to prevent conflicting access on other nodes. The need for conflict detection makes them more complicated to implement than pessimistic schemes.
Pessimistic consistency schemes are attractive when updates are relatively frequent but cannot handle disconnections, whereas optimistic schemes provide high availability in the face of disconnections but perform poorly when update conflicts are frequent. One possible strategy (adopted by Coda [20]) is to employ pessimistic consistency during periods of good connectivity, while switching to an optimistic scheme at other times.
This determines where an update can be issued. Single-master systems statically designate a master replica where all updates should be made, and are propagated to others (e.g., DNS [25], active directory schema [24], web content). These are simple to implement, butthe master is a single-point of failure. They are suitable when consistency is more important than availability. Multi-master systems allow updates to be issued at any replica, and coordinate to exchange each other's updates (e.g., active directory data [24], AFS [17], Bayou [7]). They may involve coordination of multiple replicas, are more complex and may introduce update conflicts in an optimistic setting. Hence they are more suitable when data availability is more desirable than consistency.
Pull-based schemes make each replica responsible for polling others for updates (e.g., Coda, WWW), whereas push-based schemes make a replica with a pending update responsible for delivering it to other replicas (e.g., Usenet [34], Porcupine [33], Active Directory [24]). Pull-based schemes are suitable for read/write data sharing environments involving large scale but infrequent read-sharing. Push-based schemes are more suited to frequently read data and for data distribution.
This determines the condition under which update propagation is initiated. It may be user-initiated (e.g., when user presses ``refresh'' button), timer-driven (e.g., every 3 hrs in active directory), topology-based (i.e., frequency decided based on communication costs and connectivity situation), data-driven (i.e., based on tolerated staleness of data as in TACT [39]; this is application-specific), eager (i.e., upon update arrival), or lazy (e.g., when an access request arrives).
All the dimensions in the above classification are largely orthogonal to each other, and most existing consistency mechanisms adopt a combination of options in each of these dimensions. Hence the above dimensions naturally serve as effective consistency control parameters. By allowing applications to specify various options along these dimensions, the system can tune its implementation and achieve good application performance.
In this Section, we examine the consistency management needs of the three application classes mentioned in Section 1.1 that we target for our thesis. Examples of the kinds of shared data applications that we are targeting include distributed file service, directory services, the WWW, and distributed simulations. Such applications typically have significant locality to exploit and hence can benefit from caching. However, their data access patterns (e.g., relative frequency and spread of read and write operations) and availability requirements vary widely. For instance, system binary files and static web content are infrequently but widely read, rarely updated and can tradeoff consistency for availability. Hence single-master optimistic consistency with pull-based update propagation works well for these files. System configuration files have similar properties, but with one difference: they are frequently read. A pull-based update propagation approach overwhelms the master with frequent pull requests from many clients. Hence for these files, a master-initiated push through the replica topology performs and scales better than the pull-based approach. Similarly, the availability needs of data in a partition situation varies from one type of data to another. Directory schema updates should be disallowed in all but one partition, whereas read-only access is more desirable for shared project files. These variations can be enforced with optimistic vs. pessimistic consistency schemes.
Data dissemination involves one-to-many communication of data such as is commonly performed as part of multimedia streaming, stock ticker distribution and distributed sensor logging. Data exchange involves many-to-many communication among nodes such as in the case of internet-based chat rooms, Usenet newsgroups, internet games. Both of these application classes need an efficient data dissemination mechanism from producers to registered consumers and hence can benefit from system support for asynchronous update propagation based on multicasting. An intermediate network of data caches can help absorb the effects of transient disconnections and variable network latencies. A single-master approach suits the former class, while a multi-master approach provides better update latencies for the latter class. Some of these applications, such as remote status monitoring, need control over the quality of reported data on a per-client basis to suit individual clients' bandwidth availability. For this they need control over update propagation strategies (such as timer-driven and staleness-driven propagation).
In summary, the diverse consistency requirements of these classes can be supported well by the combination of a few consistency policies with control over their implementation along the dimensions that we identified above.
In this Chapter, we describe the design of a prototype wide-area data store called Khazana that we are developing as a test-bed to evaluate our thesis. We first give an overview of Khazana and move on to describe its consistency management subsystem in detail, which is the focus of our thesis.
Khazana exports flat files called regions4.1 and provides session-oriented read/write access to these files at page-granularity. Khazana is organized as a collection of cooperating peer data server processes called Kservers. Each Kserver utilizes a portion of its persistent local store for permanent copies of some files and the rest as a cache of remotely stored files. Administrators can run Kservers on many machines across the wide-area. Kservers discover each other through some external means such as a directory service and/or as a side-effect of file lookups. Each Kserver monitors its connection quality (latency, bandwidth, connectivity) to other Kservers with which it communicates and uses this information for controlling its protocol-related communication. Applications access Khazana files through a library called Kclient that encapsulates all communication with Kservers while exporting the Khazana session interface.
A file's permanent copy site is called its home node; there may be many cached copies (secondary replicas) at other nodes. A Kserver creates a local file copy when a client contacts it to access a file that it does not store locally.
Khazana files are named by unique numeric IDs (called KIDs) assigned to them by their home node at the time of their creation. To simplify file location and management of ID space for the purpose of our prototype, we employ a simple scheme wherein each Kserver manages its own local ID space and a file's KID is a combination of its home Kserver's network address and its ID within the Kserver's local ID space. A Kserver finds a file's permanent copy site based on the address hard-coded in the file's KID. We choose this scheme mainly for its simplicity and because naming and location management are not the main focus of our thesis. Khazana's design allows this scheme to be easily replaced by more sophisticated file naming and dynamic location tracking schemes such as those of Pastry [30], Chord [35] and Tapestry [40], without affecting the contribution of our thesis.
Even though a file can be always located by the home node address hard-coded in its file name, each Kserver maintains a cache of hints to quickly locate nearby Kservers containing copies of a file, before contacting its home node. This reduces access latency if the requesting Kserver is weakly connected to the file's home node relative to other copy sites. Our idea is analogous to cooperative proxy web caching [11], but is intended to work for write-shared data as well.
When a Kserver queries another Kserver for a file copy, the responder can either supply the requestor a copy itself, or supply a list of other replicas to query in turn (e.g., if it is overloaded). In the latter case, the querying Kserver sends its request to one of these sites based on its perceived connection quality to them (by contacting nodes with high connection quality first). As a side-effect, the supplier Kserver adds the requestor to its list of child copy sites and the requestor makes the supplier its parent. This replica creation mechanism dynamically forms a hierarchical network of replicas rooted at the file's home node, similar to the dynamic hierarchical caching introduced by Blaze [1]. This network tends to be organized such that nearby nodes are directly connected to each other. Alternatively, administrators could also specify more optimal topologies to be used. The fanout of any node in the hierarchy (i.e., number of child copy sites) is limited by the load-handling capacity of that node. When a link or node in the hierarchy goes down, the orphaned nodes try to attach themselves to the hierarchy again, starting at a known copy site (home node by default), effectively repairing the replica hierarchy. A Kserver is free to delete locally cached copies of files to reclaim local store space at any time, after propagating their updates to other copy sites.
Khazana exports the following operations to its clients via its library interface:
Khazana's consistency management (CM) subsystem is responsible for maintaining coherence among file copies created as described above. Khazana provides user choice over a set of consistency policies on a per-file basis, and over the various mechanisms to be used in implementing these policies. All the consistency schemes utilize a common set of underlying update propagation mechanisms that operate via the replica network. These mechanisms only affect the performance but not the guarantees provided by the consistency policy. The decision on when to allow read/write access and propagate updates thus depends on the choice of consistency policy and the implementation strategy. In contrast to our approach, other existing systems (such as WebFS [36] and Fluid replication [5]) bundle these mechanisms as part of the implementation of each individual policy, and do not provide application control over their selection.
For the validation of our thesis, we plan to support the following consistency policies on a per-file basis in our prototype, and evaluate their suitability for a variety of applications in our targeted application classes. None of these policies is new, nor is our design restricted to use only these policies. Alternate policies can be implemented using our consistency framework while still using its basic mechanisms. All replicas of a region enforce the same consistency policy set for the region in contrast to systems like Fluid replication [5] that enable different clients to see the same file at different consistency levels.
We provide the following ways for applications to guide consistency implementation at a given replica site. Some of these parameters are settable on a per-region basis and affect all replicas. These can be set using the set_attr() interface operation described in Section 4.1.3. Some of them can be set on a specific replica and affect only that replica. They can be set using the set_copy_attr() interface operation. For explanation of these parameters and their values, please refer to Section 3.
To aid in update propagation and detection of update conflicts, each replica maintains the following persistent state (common to all consistency policies) similar to the versioning state maintained in Bayou [7]:
The consistency manager at a replica site gets activated just before opening a session to perform admission control and to bring the local copy up-to-date for access, and just before closing the session to propagate updates (based on its propagation strategy). A replica is said to ``commit'' its local updates when it successfully sends them to its parent. How often and under what conditions a replica propagates local updates to its neighboring replicas (parent and children) depends on the consistency policy and the control parameters set for the replica. However, all the policies we implement use common underlying mechanisms for update propagation, which we describe in the following Sections.
A replica R has uncommitted updates if its local_version is higher than its parent.version_sent. To commit local updates, R sends the updates to its parent along with its parent.version_recved and local_version. The parent compares incoming parent.version_recved with its own local_version. If local_version is higher, then this indicates an update conflict, as the parent already received a previous update on the copy that the child R has independently updated. To incorporate the remote update, the parent receives and applies the update locally, increments its own local_version and saves it in R.version_sent. It then stores R's local_version in its R.version_recved, and sends its own new local_version to R. R stores this number in its parent.version_recved, completing the update commitment from R to its parent.
A parent replica has outstanding local updates with respect to a child replica C, when its local_version is higher than its C.version_sent. When the parent decides to propagate outstanding updates to C, it sends the updates along with its own local_version and C.version_sent. C compares the incoming C.version_sent with its own local_version. If the local_version is higher, it indicates an update conflict, as C has some uncommitted updates, whereas its parent has independently updated the region. To incorporate the incoming update , C receives and applies incoming update on local copy, increments its own local_version, and saves it in parent.version_sent. It then stores parent's local_version in its parent.version_recved, and sends its own new local_version to parent. The parent in turn stores this number in its C.version_recved, completing the update propagation from parent to C.
Once a child replica's updates have been accepted by its parent, the parent treats those updates as its own for the purpose of propagating them to other replicas. The child should not propagate those updates to any other new parent that it contacts later (due to disconnection with old parent and re-attachment to replica hierarchy) Otherwise, they will become duplicate updates.
When a replica R loses connection to its parent and needs to re-attach itself to the hierarchy, it looks up for a new accessible replica P using Khazana's region location mechanism described in Section 4.1.1. It then sets P as its parent, obtains a fresh new copy from P (along with its version number) and re-applies its local uncommitted updates (if any) on this copy. Otherwise, if R just propagates its local updates to P as described above, there is a chance that R propagates updates that it received from its old parent to its new parent, causing duplicate updates. There is no need to re-apply updates when operation logging is not being performed.
An update conflict arises when two replicas are independently modified, starting from the same version. Pessimistic locking-based consistency schemes prevent such conflicts by delaying conflicting operations, whereas optimistic schemes allow them, but employ versioning to detect and resolve conflicts. In our update propagation scheme, a conflict can be detected either while committing a child replica's updates at its parent, or when applying a parent's committed updates at a child replica site.
The action to take when a conflict is detected is consistency policy-specific. Conflict resolution can either require manual intervention, rely on resolution rules enforceable by the system itself (e.g., last-writer wins, or using merge procedures), or need application-specific techniques. The rule-based approach is simple to implement, but may not successfully resolve all conflicts. In case of the last-writer-wins optimistic policy, the incoming updates are incorporated into local copy. In case of the append-only consistency policy, the updates are applied such that parent's updates appear first and the child's uncommitted updates appear next.
The frequency of propagation of updates from a replica is partly determined by the consistency policy and partly by the control parameters set for the replica. In case of pull-based update transfer, updates at a replica are sent to others only upon explicit request. In pessimistic consistency schemes, this update request is issued before allowing read access to a replica, whereas in optimistic schemes, this is performed periodically at a frequency settable on a per-replica basis via the control parameters described in Section 4.2.2. When a client registers for asynchronous notification of updates to a replica, the replica site in turn sends a registration request to its parent to push updates to it based on a transfer condition set on the replica in a previous set_copy_attr() call. The parent can in turn forward this to its other neighbors and so on. Later, when an update arrives at a replica site, it propagates the update to all registered neighboring replicas (i.e., parent and children) whose update transfer condition is satisfied.
We implement the pessimistic strict consistency policy in Khazana as follows. At each replica site, we maintain an access token that indicates what operations (among read, update and append) are allowed on that replica. Strict consistency requires that only one replica hold the update access token at any time. In a single-master implementation, one replica site (the home node in our prototype) always keeps the update tokens and other replicas forward all update requests to this replica. In a multi-master implementation, this write token gets circulated among the replicas based on write requests arriving at their sites. Multiple replicas can hold read tokens, but they are all revoked (via the replica hierarchy) before allowing write access. When a replica site transfers its write token to another replica, it also triggers propagation of its updates to the receiver of the token, via the replica hierarchy.
We implement the optimistic policies (last-writer-wins and append-only) in Khazana as follows. All read operations are allowed to proceed immediately on all replicas. Updates requests are forwarded to to home node in case of single-master approach. When an update is made at a replica, it is propagated to others based on the update transfer condition set at various replica sites. Update conflicts are resolved as described in Section 4.2.3. Append operations are logged explicitly as they need to be re-applied on a replica during conflict resolution.
In this Chapter, we described a prototype wide-area data store called Khazana, that we are building to evaluate our thesis. Our design splits up consistency management into a collection of core mechanisms that serve as building blocks for composing a variety consistency schemes to suit specific application needs. Our hierarchical organization of replicas and multi-level update propagation and conflict resolution using the hierarchy limits the amount of state that needs to be maintained on a replica site, making our implementation scalable to a large number of replicas per object.
The success of our thesis lies in demonstrating that flexible consistency management over the wide-area has the following properties:
In our prototype, we propose to evaluate our system's scalability to about 512 clients served by up to 128 servers (this limits the number of replica per object to 128), with about 1000 objects (files) per server, over a simulated WAN spanning up to 10 LANs connected by variable speed links into various topologies. We choose these limits considering the resources available in the Utah network test-bed [8] which we plan to use for our experiments. We also plan to conduct a wide-area experiment with up to 5 servers geographically distributed across the U.S., to evaluate our system's performance in real wide-area conditions.
We will measure the access rate sustained by our system at various scales.
In the following Sections, we describe four representative Khazana-based applications and the set of experiments we propose to conduct to validate the above claims. The data coherence needs of these application vary widely. We will show that the various combinations of policies and implementation techniques that we provide are sufficient to support these needs well. We will also demonstrate how some of the consistency policies are useful for more than one application. These together validate our reusability claim 2(a).
DataStations is a distributed file service that uses Khazana's caching and consistency management to provide coherent read/write file sharing across the wide-area. This application belongs to the data-sharing class. For our thesis, we focus on supporting files with five different access patterns: type 1 - infrequently updated by a few clients, and infrequently read by many (e.g., system files); type 2 - infrequently updated by a few clients, but frequently read by many (e.g., configuration files); type 3 - frequently read/written by a single mobile user, but rarely shared among multiple users (e.g., personal files); type 4 - frequently updated among a few users, but read by many (shared project repository files such as those in a CVS repository); type 5 - widely read and updated append-only files (such as log files). An application can individually set different consistency policies and control parameters for a Khazana file to suit its behavior, any time. Khazana does not infer these parameters by observing the file's usage patterns.
With this experiment, we propose to demonstrate how our flexible multi-policy consistency management system can support access to a large number of files with the variety of access patterns described above, and evaluate its scalability. We employ single-master optimistic consistency with pull-based updates for type 1 files and push-based updates for type 2 files. We employ a multi-master pull-based pessimistic consistency with token-based read/write access for type 3 files. We employ multi-master pull-based pessimistic consistency with lock-based writes for type 4 files, and multi-master pull-based optimistic append-only consistency for files of type 5. The consistency policy and control parameters for a Khazana file can be set and/or changed any time by using its set_attr() and set_copy_attr() interface operations.
Our experimental setup will consist of 128 servers spread across 10 LANs. We will create many files per server uniformly belonging to the 5 types listed above. Clients spread uniformly across these LANs perform random read and write operations on these files, adhering to the usage patterns described above for each type of file. We will measure the average access latency for read and write operations for each type of file and the average throughput (operations per second) perceived by clients for various numbers of clients from 2 to 512 in increasing powers of 2. We expect the variation of access latency with load for each type of file to be reasonable and not to be drastically affected by simultaneous activity on other types of files. This evaluates the overall scalability of our system with number of clients.
We will repeat this experiment with various numbers of files per server from 8 to 1024 in powers of 2 and measure the access latencies as above. This evaluates the scalability of our system with the number of files.
The techniques needed to efficiently maintain consistency varies with the file's usage patterns. For instance, since files of type 1 are typically written by a few clients (e.g., administrators), a single-master optimistic approach to consistency works well. Also, since they are infrequently read by many users, a pull-based mechanism for propagating updates to readers suffices. On the other hand, type 2 files slightly differ from type 1 files in that they are frequently (continually) read by a lot of users (e.g., consider configuring 1000's of workstations in corporate Intranets). A pull-based mechanism in their case overwhelms the single-master with pull requests from many readers, much of the time. Hence for type 2 files, a master-initiated push through a hierarchical replica topology performs and scales better than pull-based approach.
We propose to demonstrate this as follows. We create a file of type 1 on a Kserver and perform updates to it from a few clients randomly (once every 5-10 sec), while randomly reading it (once every 30-60 seconds) from a set of clients on other LANs. We measure the average access latency and overall bandwidth utilization for read and write operations when employing push vs. pull-based update propagation using single-master optimistic consistency. We do this while increasing clients in powers of 2, from 8 to 512. We repeat the experiment with the type 2 file being read by each client once every second. We expect the bandwidth utilization to be lower for the pull-based approach in case of type 1 file and for the push-based approach in case of type 2 file. We expect the gap in utilization for the two approaches to increase with the number of clients. This experiment validates our claim 2(c).
Different types of files have different combinations of requirements for consistency(C) and availability(A) in the presence of intermittent connectivity and partitions(P). In fact, the CAP principle (stated by Fox et al. [10]) says that at most two out of these three properties can be achieved for a given piece of data. The level of availability that can be achieved in intermittent connectivity environments is limited by the strength of consistency guarantee needed by application data. To support a variety of combinations of C and A requirements, the system needs to provide consistency policies with a variety of consistency and availability combinations. Many choices have been explored by prior research. For the purpose of our thesis, we describe three useful C-A choices with distinct behavior in the presence of network partitions and show that they can be supported as part of our consistency management system. They are: option 1 - allow unrestricted read/write access to data within a partition, and resolve update conflicts either with a simple rule, or refer them to user; option 2 - allow read-only access in all partitions, but read-write access in at most one partition; lastly; option 3 - disallow access in all but one partition. Option 1 is suitable for files of type 3 and 5 and can be achieved with multi-master optimistic consistency. Option 2 can be enforced using a slight variation on multi-master pessimistic lock-based consistency that allows write access in the partition containing the write privilege, and is suitable for files of type 4. Option 3 is provided by strict consistency policy with read and write locks, and is suitable when either stale data is useless (e.g. game state in a distributed multi-player game), or it is important to ascertain internal integrity of data before allowing access (such as in databases, data structures like hash tables and B-trees). It is suitable for our directory application described in the next section.
We demonstrate the variety of options for disconnected operation by repeating experiment 1 with a fixed number of files and clients, but forcing inter-LAN link failures and node failures. We expect to demonstrate that option 1 files can still be accessed in unrestricted manner, option 2 files can only be read, while option 3 files cannot be accessed at all. We show this by indicating a very large access latency for inaccessible operations in our graphs. We show how our system behaves when the links and nodes come back up.
In its essence, a directory stores mappings between (fixed-length) names and values and exports four operations: add, delete, modify, lookup. This service falls under the data-sharing class and can be used as a building block for a variety of useful applications such as user profile databases and text search indexes. We will implement a prototype directory as a fixed-size page-based hash table inside a Khazana file. The directory is widely shared and all operations are frequent. Users access the directory through a library that manipulates its cached copies of the shared file's pages using the Khazana interface. Consistency among the cached copies of the directory's pages can be maintained using Khazana's pessimistic strict consistency policy or an optimistic last-writer-wins policy with operation logging. Performance varies depending on how widely the directory is shared and updated. The latter policy is complex to implement, but provides better concurrency and allows large-scale sharing. For our prototype directory implementation, we will employ a multi-master strict pessimistic consistency scheme with pull-based update propagation among replicas.
With this application, we propose to demonstrate how our consistency management system can be used to support widespread cached access to a page-based data structure in a way that provides strict coherence guarantees. To show this, we create a single directory and perform various operations on it from multiple clients in different LANs, each of which operate on the directory through a nearby Kserver (at most one per LAN). Kservers cache the directory's pages as a side-effect of access. Each client performs various mixes of 1000 operations at the rate of 4 or 5 operations per minute. The first mix consists of relatively large number of additions, to populate the directory. The second mix consists of a lot of lookups spanning many LANs, with relatively fewer updates. The third mix contains an equal proportion of all operations spanning many LANs, but with clients exhibiting significant locality within a LAN. This last mix illustrates typical usage scenario for a directory in an Intranet spanning multiple campuses. For each mix, we measure the average response time and bandwidth utilization per operation, while increasing the number of clients from 2 to 128 and creating replicas as a side effect. We expect the directory's performance to scale reasonably well for the second and third cases, but not for the first one. This is inherently due to the consistency policy. Scalability in the third case is due to the aggressive caching of working sets of directory pages in various LANs avoiding WAN communication.
We will repeat the experiment an existing directory implementation such as Microsoft's Active Directory [], configured with the same number of directory replicas as Khazana-based version and compare their performance. Existing directories employ static replication, but not dynamic caching. Although we cannot expect our untuned prototype to perform as well as the existing highly tuned directory in absolute terms, we expect performance of our version to be within a reasonable factor of that of the existing version. However, as clients exhibit significant locality, the aggressive caching employed by our implementation helps improve the directory's response time with increasing load. Also, implementing our directory requires much less effort by virtue of being Khazana-based.
In addition to traditional read/write data access, the flexibility of our consistency management system allows it to efficiently support distributed data/event logging and many-to-many data exchange as well, across the wide-area. To demonstrate this, we implement an Internet chat room as concurrent appends to a single transcript file in Khazana employing its append-only consistency scheme. The file containing the chat transcript can be replicated sufficiently to provide low-latency nearby access to its widely dispersed chat clients. We employ multi-master append-only optimistic consistency with push-based update propagation along the replica topology to achieve efficient multicast of appends. Clients register at a replica site to notify it of remote updates asynchronously.
This service tests the latency and efficiency of real-time data propagation in Khazana. We evaluate the responsiveness and scalability of the chat room by starting clients on multiple LANs. Each client registers for asynchronous updates to its copy and writes to the chat room randomly once every 30 seconds. We measure the performance of the chat room in terms of the time taken for an append to be accepted by the system, the minimum and maximum taken for an append to reach another chat client. and the bandwidth utilized for the chat session. We measure this while gradually adding clients (increasing from 2 to 512) on more and more LANs (up to 10).
We repeat this experiment with the single-master non-hierarchical caching and consistency scheme and measure the above parameters. This approach is similar to that of existing chat implementations, but uses more bandwidth. We also repeat the experiment with an existing single-server-based popular chat implementation such as IRC [4].
We expect to show that propagation time is reasonable when compared to a single-master implementation as well as IRC, but the bandwidth utilization is much better. Propagation time does not change much as more clients are added to a LAN, but gracefully degrades as clients spread across LANs. Also, we expect to show that a network partition affects only those inside the partition; others can continue to chat.
We also propose to measure the time taken for a new chat client to join the session and obtain the entire transcript both in the single-master and multi-master implementations, and show that the latter takes less time.
Data dissemination differs from real-time data exchange in that data is typically updated by only one entity, whereas it is potentially monitored by a large number of clients, typically with varying consistency requirements and tolerance to staleness of data. Even though IP multicasting [6] performs better for this, achieving data multicast as a side-effect of update propagation among replicas has several benefits. It allows intermediate data staging along the multicast path to cache data closer to its clients, reducing access latency and sender's load. It enables the quality of multicast-ed data and frequency of data updates to be customized at intermediate hops to the consistency level needed by individual replicas, unlike IP multicasting, where data quality is controlled only at the sender. Client control over data quality level is desirable in applications such as multimedia streaming because there is cost (e.g., bandwidth utilization) associated with maintaining a given data quality level and it depends on the bandwidth available to the client.
We demonstrate the value of adapting update propagation to varying consistency levels on a per-replica basis by implementing a scoreboard. The scoreboard application broadcasts the status of an ongoing game updated (sporadically) at a site to registered listeners across a WAN. Each listener can set how often her copy needs to be refreshed (based on either time or the maximum number of unnotified scoreboard updates tolerated). Each listener can change this refresh rate dynamically as the game progresses at the remote site. User can also explicitly request a refresh. The scoreboard state is kept in a Khazana file. The file's replica consistency policy can be set to single-master push-based last-writer-wins optimistic consistency. The consistency management system can batch updates before propagating them and tune the amount of update traffic generated based on network conditions and tolerated staleness by listeners. This application illustrates one-to-many non-real-time data dissemination using Khazana's consistency management system. In addition to asynchronous updates, this application requires the ability to tune the update propagation rate at a replica site, as well as to explicitly request update propagation.
We evaluate how Khazana's update propagation can be adapted to provide a variety of data quality levels at individual replica sites. For this, we create a scoreboard and update it randomly once every 10-15 sec. We run clients on multiple LANs, with each client registering for updates. Some of them register for updates every D seconds, where D is chosen randomly from 15 to 120 sec. Some others choose maximum number of unnotified updates randomly from 1 to 5. We measure the overall bandwidth utilization for propagation of 1000 updates and contrast it with the case where updates are immediately propagated to everybody. We also measure the time taken by a new client to register and receive the first update with and without intermediate caching. We repeat this experiment varying the number of clients from 2 to 512 in powers of 2.
We expect the bandwidth utilization for the customized propagation case to be lower than the eager propagation case and the gap to increase for larger client sets. We also expect the time taken for first update not to degrade with increasing number of clients in case of intermediate caching, and expect it to degrade in the absence of intermediate caching. This experiment validates our claim 2(d).
In this Chapter, we described the success criteria for our thesis by stating a set of claims which together validate our thesis statement. We proposed to implement a set of useful wide-area applications selected from distinct application classes. We then described a series of experiments and their expected results that serve to validate each of our claims. These experiments also demonstrate the performance and scalability of the applications using our consistency management system, thereby establishing its feasibility and usefulness.
We currently have a working implementation of our prototype, Khazana that runs a LAN and provides API for allocating, freeing and session-based read/write access to Khazana regions. Khazana nodes (currently limited to 64) cooperate to locate regions, and create a hierarchy of cached replicas of a region as a side effect of access. We have a single-master pull-based last-writer-wins optimistic consistency algorithm working in Khazana. With this, we have implemented a prototype file system called KidFS to use a Khazana region per file. It provides a traditional hierarchical naming layer using directories implemented as fixed-size closed hash table data structures inside Khazana regions.
In this thesis, we proposed to show the feasibility and value of implementing a consistency management system that is flexible enough to serve the data access and delivery needs of a variety of wide-area applications, in the context of a wide-area data storage and caching infrastructure called Khazana. We target applications in the three classes, namely those involving 1) read/write access to shared data/state, 2) data dissemination and 3) data exchange among distributed components over the wide-area. Our approach to achieving scalability, flexibility and performance is to provide a set of useful consistency guarantees for our targeted application classes while exposing enough hooks to allow applications to guide the system's consistency management mechanisms. Unlike other systems that provide one or the other of scalable wide-area support and flexible consistency, our approach is to provide both in a single system with control over its mechanisms.
We proposed a plan to evaluate the effectiveness of our approach and its scalability in the number of objects and clients. Our approach provides several benefits over existing systems. Specifically, it enables reuse of consistency mechanisms for a variety of applications with coexistence of different consistency schemes in the same system; it enables customization of implementation for performance; it also helps achieve scalable data delivery across the wide-area; it conserves bandwidth by enabling applications to customize data quality during update propagation to replica-specific needs.
This document was generated using the LaTeX2HTML translator Version 99.2beta8 (1.46)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html proposal.tex
The translation was initiated by Sai Rama Krishna Susarla on 2001-10-16