Building a Lower Cost Generic Object Store

We plan to provide a generic storage service that offers attractive "enterprise" qualities, but at lower cost. We plan to address the issues sketched above by having multiple copies of each object, which will get propagated to multiple stores, some geographically distributed. As demonstrated, it is possible to make large cheap (and cheap-to-maintain) stores this way. Having multiple copies addresses our additional needs in several ways:

Durability: Rather than tape backup, multiple disk copies provide durability after the loss of a single copy. To guard against physical disasters, copies will be distributed to physically disparate sites, for example, to the different campuses involved in CITRIS. Given the cost of tape backup, and the low cost of commodity storage, it should be possible to provide enough copies of an object to provide at least the equivalent durability afforded by tape at a lower cost.

Availability: Multiple copies on separate servers provides a some level of availability in the case of disk failure.

Support for Multiple Platforms: Copies can be kept on different platforms as well. For example, those users needed both NTFS and NFS can arrange to have copies propagated to stores separately supporting each file system type.

There are any number of technical choices involved in the implementation. For example, rather than making simple copies, durability can be enhanced using erasure codes, which provides much higher probability of recoverability for the same amount of redundant storage. Similarly, copies need to be synchronized, for which any number of strategies exists, ranging from use of rdist and gcp, etc., to more experimental approaches. Versions need to be identified, for which many schemes exist. Availability can be enhanced by using IDE RAID, which seems to offer good performance at about twice the cost of un-RAIDED disks (Chung, Gray, Horst and Worthington).

There is also the question of how technologically aggressive to be. At the most aggressive end, Oceanstore provides an end-to-end transparent approach to most of our concerns. However, whether we can actually deliver such a service to users in a practical fashion is an open question. Also, moving from more reliable somewhat higher-end servers to lower cost generic servers may impose an availability penalty we are unwilling to pay. Therefore, we will not commit to particular mechanisms here, but note that there are any number of schemes that will likely prove suitable for our purposes, and may attempt to introduce increasingly technologically aggressive ones over time. Below we describe some of our research that we hope will inform and support the infrastructure components we seek to develop here.

Note that, while each of these projects is limited by storage needs, none of them is a storage infrastructure project, so no individual project can afford the effort needed to set up and maintain a disk-to-disk backup system, etc. Just as Millennium was an efficient way of providing low-cost cluster computing to a large number of projects, it is only feasible to try to deliver low-cost storage if we aggregate enough needs to approach the problem at a large enough scale.

Thus, we propose to construct the storage elements needed to support this evolving view of universal, high-capacity, low-overhead storage within the context of UC Berkeley's CITRIS environment. CITRIS is a useful environment because it generates a wide variety of storage needs for a broach range of research projects, some of which are CISE-related. Thus, this proposed facility will both let us address specific needs of a large group of CISE researchers that cannot otherwise we addressed by individual research grants, but will also provide a testbed for creating the storage infrastructure of the future.

Smart distributed storage nodes

As suggested above, it is now feasible to provide as a networked service significant storage based on inexpensive commodity parts. Current (and improving) disk-drive technology makes terabyte mini-servers feasible for ~$4,000. One can distribute a number of such servers at appropriate points in the network, and configure them with a range of increasing technologically aggressive properties. At the most conservative end, these systems run as more or less independent, if cooperating, entities. At the more aggressive end, the nodes act as a coherent storage system, such as that of Oceanstore. We plan to begin with a more conservative architecture, and attempt to roll out more technologically aggressive services as the technology matures.

Fundamental to using such nodes is the model of decentralized, peer-to-peer storage that has recently emerged as viable for scalable, highly-available, and highly-durable storage. Peer -to-peer storage systems assume that physical components (e.g. servers, disks, and network connections) are both unreliable (i.e., can fail at any time) and untrusted (i.e., potentially compromised and under control of adversarial forces), and rely instead on the behavior of multiple, redundant elements working together. Peer-to-peer architects exploit this redundancy, along with a variety of cryptographic techniques, to achieve guarantees that are normally expected of centrally managed storage infrastructures. Indeed, the process of assuming that physical components are fallible has been an extremely productive one for system architects. The resulting systems promise to achieve globally-available storage that is far more stable, durable, secure, and easy to manage than any existing solutions.

A peer-to-peer approach uses a large number of distributed storage nodes, and allows attractive aggregate properties to emerge from these via using a number of techniques, three of which we plan to exploit here: location-independent routing, versioning, and Byzantine Agreement.

Location-Independent Routing: One of the most important enabling technologies for distributed storage is a data location layer. Such a layer may be viewed as simply a directory service, able to track the locations of every replica of every piece of information. The presence of such a layer greatly increases availability-, as when one server crashes, the location layer is able to direct queries for data to backup servers. The data location layer can itself be distributed across many nodes in the system for fault tolerance.

Distributed systems capable of locating information are sometimes described as Decentralized Object Location and Routing systems (DOLRs). DOLR systems provide a network abstraction of routing directly to objects by name, thus supporting fault-tolerant access of information independent of its location. Indeed, several of these systems, including Tapestry, developed at Berkeley, continue to operate correctly even as components fail or come under attack. Moreover, data can be moved, repaired, and copied without explicitly informing the consumers of the data, thus greatly enhancing the potential for automatic management of information.

Versioning: It is plausible to consider keeping every version of every object indefinitely on disk. We call this data model "versioning". With versioning, each update of a data object generates a new version.

Versioning has several consequences. First, each version of an object now becomes read-only and can be uniquely identified via cryptographic hashes, such as SHA-1. With a sufficiently large hash value (SHA-1 produces 160 bits), collisions will never occur, and as a result, each version can be uniquely identified forever by its hash value, which can be used as the version's Globally Unique IDentifier (GUID).

Second, the read-only nature of each version greatly simplifies archiving. Read-only data can be stored in efficient formats such as via erasure codes and can be actively repaired, i.e., automatic processes can scan through the data on a regular basis and filter out bad copies by checking each recovered piece of information against its GUID. The result is fundamentally more powerful than traditional tertiary storage infrastructures utilizing archival media such as tape. Since information is continuously repaired and moved from physical media to physical media, it becomes independent of media degradation and obsolescence.

Third, newer versions of an object can contain references to previous versions by including GUIDs of these previous versions. This fact permits efficient support of versioning via "copy on write". It also permits the ability to "time travel" or view the state of a data object at an arbitrary point in the past. Time travel obviates the need for traditional backup infrastructures and permits easy recovery from faults and human error. As an additional consequence of versioning, cached copies can be kept coherent by distributing the GUID of the latest version to each cached.

Byzantine Agreement: Byzantine Agreement allows a set of peers to come to a unified decision about something, even if some of them (less than 1/3) are actively attempting to compromise the process. Should the correct number of nodes agree, the result can be signed in aggregate with threshold signatures to permit others to verify the result of the decision at a later date. Byzantine agreement is an ideal technology to embed in the center of an archival system to provide write access control and guaranteed levels of replication.

The above three technologies are beginning to appear in a number of storage infrastructures including OceanStore (Berkeley) and Farsite (Microsoft). When combined with fast caching of live data (see below), these three technologies can yield fast, secure, and durable storage. One of these techniques is that a peer-to-peer storage infrastructure should consist of many storage servers with large amounts of storage, sufficient computational power to perform cryptographic operations at network speeds, and sufficient network bandwidth to serve as an active DOLR router. This is what we call a smart storage node.

Distributed Storage: The Initial Step

As a first step to a distributed storage infrastructure, we plan to distribute a series of independent storage nodes to geographically distinct sites around the Berkeley campus and at other CITRIS-affiliated (particularly UC Davis). We will build a simple indirection directory service as a small number of "gateway" machines that are responsible for tracking the locations of replicas of data. Such a degenerate version of a DOLR is easily implemented in the form of LDAP servers. These servers will monitor the state of the storage servers and the data stored on these servers (at a very course grain, such as directories or file systems). When a storage server crashes, these directory servers will be capable of suggesting which server is available for reading the requested copy. We will make the LDAP servers highly-available through clustering or other established technology.

In this manner, we can provide archival disk storage by building a very simple storage layer that will indirect through the directory servers to access information on the distributed storage servers. The storage servers themselves will utilize commodity storage technology in the form of file servers exporting NFS or NTFS. Basic versioning in this system will consist of nightly incremental backups to the distributed storage servers. In this incarnation, we will cache complete copies of files on storage servers and only push updates to files that have changed.

Archival storage will be regarded as read-only, with every version uniquely named with a hash value (GUID). Thus, a storage server will know if it has the requested GUID, and will never have to worry about having an out-of-date version. A background process will seek to migrate new versions to servers that unavailable when versions are written.

The mapping between file names and GUIDs will be secured on highly-available storage (such as, for instance, the LDAP servers). There are a number of relevant techniques that may be utilized here. For example, one archival storage system we have constructed generates read-only archival versions of hierarchies of directories. In this case, the only mapping required to discover the latest copy of a directory hierarchy was a mapping from the top-level directory to the GUID of its latest version.

Distributed Storage: A More Advanced Model

A more ambitious goal is to construct a much more distributed system. We do not think it is realistic to provide such a production system during the lifetime of this grant, but plan to work toward it incrementally as an ultimate goal.

We plan to design such a system by building a "wide area" Tapestry starting with the Tapestry DOLR and utilizing the >a href="http://www.planet-lab.org/">PlanetLab infrastructure. PlanetLab is a global overlay network, currently spanning 100 sites and eventually growing to thousands, and exhibiting a rich diversity of link technologies, including some sites located at major routing and co-location centers of the Internet. While most of these sites have modest resources, several will exhibit substantial processing and storage resources-including the site supported by this experimental infrastructure proposal if funded. Preliminary versions of Tapestry already run on PlanetLab.

On top of Tapestry, we will place a unified storage system that supports incremental versioning and efficient archival storage through erasure coding. At this stage, we will place a larger number of storage servers throughout the network, thereby supporting the independent storage of archival data fragments, in which only parts of files that have changed are actually pushed out to the archive. Techniques can be used to recognize which sub-blocks of files have changed.

At this stage of development, client-side software to support our archival storage interfaces will be desirable. Once such interfaces exist (for instance, User-level file system support through custom NFS daemons), we can envision a smooth integration between the archival components and the fast data cache (described in below). Here information that was in active use could be transparently cached while inactive data could reside only in remote storage servers. Also at this stage, we can begin to experiment with active repair techniques that monitor storage servers and, based on erasure-coded storage, reconstruct information that has been partially lost on crashed servers.

Distributed Storage: Cross-Domain Cooperative Storage

At this point, it becomes possible to consider adding some architectural elements from the SAHARA "storage wide-area networking" project. This project aims to be able to create end-to-end network services with desirable and predictable properties, like performance and availability, when provisioned from multiple independent service providers. To do so, SAHARA is developing an architecture for future network services that supports the dynamic confederation of sometimes collaborating and sometimes competing service providers. Such an architecture would be helpful in the event that individual storage servers are owned by more than one organization, but combined as a whole to provide stable archival storage. It may be possible to experiment with such organizations first within CITRIS, and then, more broadly, through PlanetLab. As a basic first step, all information would be encrypted for privacy and authenticated through cryptographic techniques such as signatures and secure hashing. More interestingly, SAHARA and Tapestry could interact in productive ways, for example, by SAHARA providing Tapestry with suggestions for routing based in its understanding of networking properties. We will incorporate pieces of this vision as they stabilize and as time permits.

Also to be explored at this phase would be techniques for committing information in the wide area through Byzantine agreement protocols. In particular, we would explore the advantages of a hierarchy of trust: Some servers would be more trusted to perform commitment of information to the archive; however, even such trusted servers would be subject to compromise, thereby dictating the need for Byzantine agreement protocols. (Prototype Byzantine agreement protocols are already functioning within the OceanStore prototype.)

These steps, if successful, provide for the possibility of constructing a "survivable wide-area storage network", in which a disaster at some number of otherwise unrelated sites will be recoverable from the still functioning sites. In particular, there is a disaster recovery application of the SAHARA project, which is described further in the applications section below.