Building a Lower Cost Generic Object Store
We plan to provide a generic storage service that offers attractive
"enterprise" qualities, but at lower cost. We plan to address the
issues sketched above by having multiple copies of each object, which
will get propagated to multiple stores, some geographically distributed.
As demonstrated, it is possible to make large cheap (and
cheap-to-maintain) stores this way. Having multiple copies addresses
our additional needs in several ways:
Durability: Rather than tape backup, multiple disk copies provide
durability after the loss of a single copy. To guard against physical
disasters, copies will be distributed to physically disparate sites, for
example, to the different campuses involved in CITRIS. Given the cost
of tape backup, and the low cost of commodity storage, it should be
possible to provide enough copies of an object to provide at least the
equivalent durability afforded by tape at a lower cost.
Availability: Multiple copies on separate servers provides a some level
of availability in the case of disk failure.
Support for Multiple Platforms: Copies can be kept on different
platforms as well. For example, those users needed both NTFS and NFS
can arrange to have copies propagated to stores separately supporting
each file system type.
There are any number of technical choices involved in the
implementation. For example, rather than making simple copies,
durability can be enhanced using erasure
codes, which provides much higher probability of recoverability for
the same amount of redundant storage. Similarly, copies need to be
synchronized, for which any number of strategies exists, ranging from
use of rdist and gcp, etc., to more
experimental approaches. Versions need to be identified, for which
many schemes exist. Availability can be enhanced by using IDE RAID,
which seems to offer good performance at about twice the cost of
un-RAIDED disks (Chung,
Gray, Horst and Worthington).
There is also the question of how technologically aggressive to be. At
the most aggressive end, Oceanstore provides an end-to-end transparent
approach to most of our concerns. However, whether we can actually
deliver such a service to users in a practical fashion is an open
question. Also, moving from more reliable somewhat higher-end servers
to lower cost generic servers may impose an availability penalty we are
unwilling to pay. Therefore, we will not commit to particular
mechanisms here, but note that there are any number of schemes that will
likely prove suitable for our purposes, and may attempt to introduce
increasingly technologically aggressive ones over time. Below we
describe some of our research that we hope will inform and support the
infrastructure components we seek to develop here.
Note that, while each of these projects is limited by storage needs,
none of them is a storage infrastructure project, so no individual
project can afford the effort needed to set up and maintain a
disk-to-disk backup system, etc. Just as Millennium was an efficient
way of providing low-cost cluster computing to a large number of
projects, it is only feasible to try to deliver low-cost storage if we
aggregate enough needs to approach the problem at a large enough scale.
Thus, we propose to construct the storage elements needed to support
this evolving view of universal, high-capacity, low-overhead storage
within the context of UC Berkeley's CITRIS environment. CITRIS
is a useful environment because it generates a wide variety of storage
needs for a broach range of research projects, some of which are
CISE-related. Thus, this proposed facility will both let us address
specific needs of a large group of CISE researchers that cannot
otherwise we addressed by individual research grants, but will also
provide a testbed for creating the storage infrastructure of the future.
Smart distributed storage nodes
As suggested above, it is now feasible to provide as a networked service
significant storage based on inexpensive commodity parts. Current (and
improving) disk-drive technology makes terabyte mini-servers feasible
for ~$4,000. One can distribute a number of such servers at appropriate
points in the network, and configure them with a range of increasing
technologically aggressive properties. At the most conservative end,
these systems run as more or less independent, if cooperating, entities.
At the more aggressive end, the nodes act as a coherent storage system,
such as that of Oceanstore. We plan to
begin with a more conservative architecture, and attempt to roll out
more technologically aggressive services as the technology matures.
Fundamental to using such nodes is the model of decentralized,
peer-to-peer storage that has recently emerged as viable for scalable,
highly-available, and highly-durable storage. Peer -to-peer storage
systems assume that physical components (e.g. servers, disks, and
network connections) are both unreliable (i.e., can fail at any time)
and untrusted (i.e., potentially compromised and under control of
adversarial forces), and rely instead on the behavior of multiple,
redundant elements working together. Peer-to-peer architects exploit
this redundancy, along with a variety of cryptographic techniques, to
achieve guarantees that are normally expected of centrally managed
storage infrastructures. Indeed, the process of assuming that physical
components are fallible has been an extremely productive one for system
architects. The resulting systems promise to achieve globally-available
storage that is far more stable, durable, secure, and easy to manage
than any existing solutions.
A peer-to-peer approach uses a large number of distributed storage
nodes, and allows attractive aggregate properties to emerge from these
via using a number of techniques, three of which we plan to exploit
here: location-independent routing, versioning, and Byzantine Agreement.
Location-Independent Routing: One of the most important enabling
technologies for distributed storage is a data location layer. Such a
layer may be viewed as simply a directory service, able to track the
locations of every replica of every piece of information. The presence
of such a layer greatly increases availability-, as when one server
crashes, the location layer is able to direct queries for data to backup
servers. The data location layer can itself be distributed across many
nodes in the system for fault tolerance.
Distributed systems capable of locating information are sometimes
described as Decentralized Object Location and Routing systems (DOLRs).
DOLR systems provide a network abstraction of routing directly to
objects by name, thus supporting fault-tolerant access of information
independent of its location. Indeed, several of these systems,
including Tapestry, developed at Berkeley, continue to operate correctly
even as components fail or come under attack. Moreover, data can be
moved, repaired, and copied without explicitly informing the consumers
of the data, thus greatly enhancing the potential for automatic
management of information.
Versioning: It is plausible to consider keeping every version of every object indefinitely on disk. We call this data model "versioning". With versioning, each update of a data object generates a new version.
Versioning has several consequences. First, each version of an object
now becomes read-only and can be uniquely identified via cryptographic
hashes, such as SHA-1. With a sufficiently large hash value (SHA-1
produces 160 bits), collisions will never occur, and as a result, each
version can be uniquely identified forever by its hash value, which can
be used as the version's Globally Unique IDentifier (GUID).
Second, the read-only nature of each version greatly simplifies
archiving. Read-only data can be stored in efficient formats such as
via erasure codes and can be actively repaired, i.e., automatic
processes can scan through the data on a regular basis and filter out
bad copies by checking each recovered piece of information against its
GUID. The result is fundamentally more powerful than traditional
tertiary storage infrastructures utilizing archival media such as tape.
Since information is continuously repaired and moved from physical media
to physical media, it becomes independent of media degradation and
obsolescence.
Third, newer versions of an object can contain references to previous
versions by including GUIDs of these previous versions. This fact
permits efficient support of versioning via "copy on write". It also
permits the ability to "time travel" or view the state of a data object
at an arbitrary point in the past. Time travel obviates the need for
traditional backup infrastructures and permits easy recovery from faults
and human error. As an additional consequence of versioning, cached
copies can be kept coherent by distributing the GUID of the latest
version to each cached.
Byzantine Agreement: Byzantine Agreement allows a set of peers to come
to a unified decision about something, even if some of them (less than
1/3) are actively attempting to compromise the process. Should the
correct number of nodes agree, the result can be signed in aggregate
with threshold signatures to permit others to verify the result of the
decision at a later date. Byzantine agreement is an ideal technology to
embed in the center of an archival system to provide write access
control and guaranteed levels of replication.
The above three technologies are beginning to appear in a number of storage infrastructures including OceanStore (Berkeley) and Farsite (Microsoft). When combined with fast caching of live data (see below), these three technologies can yield fast, secure, and durable storage. One of these techniques is that a peer-to-peer storage infrastructure should consist of many storage servers with large amounts of storage, sufficient computational power to perform cryptographic operations at network speeds, and sufficient network bandwidth to serve as an active DOLR router. This is what we call a smart storage node.
Distributed Storage: The Initial Step
As a first step to a distributed storage infrastructure, we plan to
distribute a series of independent storage nodes to geographically
distinct sites around the Berkeley campus and at other CITRIS-affiliated
(particularly UC Davis). We will build a simple indirection directory
service as a small number of "gateway" machines that are responsible for
tracking the locations of replicas of data. Such a degenerate version
of a DOLR is easily implemented in the form of LDAP servers. These
servers will monitor the state of the storage servers and the data
stored on these servers (at a very course grain, such as directories or
file systems). When a storage server crashes, these directory servers
will be capable of suggesting which server is available for reading the
requested copy. We will make the LDAP servers highly-available through
clustering or other established technology.
In this manner, we can provide archival disk storage by building a very
simple storage layer that will indirect through the directory servers to
access information on the distributed storage servers. The storage
servers themselves will utilize commodity storage technology in the form
of file servers exporting NFS or NTFS. Basic versioning in this system
will consist of nightly incremental backups to the distributed storage
servers. In this incarnation, we will cache complete copies of files on
storage servers and only push updates to files that have changed.
Archival storage will be regarded as read-only, with every version
uniquely named with a hash value (GUID). Thus, a storage server will
know if it has the requested GUID, and will never have to worry about
having an out-of-date version. A background process will seek to
migrate new versions to servers that unavailable when versions are
written.
The mapping between file names and GUIDs will be secured on
highly-available storage (such as, for instance, the LDAP servers).
There are a number of relevant techniques that may be utilized here.
For example, one archival
storage system we have constructed generates read-only archival
versions of hierarchies of directories. In this case, the only mapping
required to discover the latest copy of a directory hierarchy was a
mapping from the top-level directory to the GUID of its latest version.
Distributed Storage: A More Advanced Model
A more ambitious goal is to construct a much more distributed system.
We do not think it is realistic to provide such a production system
during the lifetime of this grant, but plan to work toward it
incrementally as an ultimate goal.
We plan to design such a system by building a "wide area" Tapestry
starting with the Tapestry DOLR and utilizing the >a
href="http://www.planet-lab.org/">PlanetLab infrastructure.
PlanetLab is a global overlay network, currently spanning 100 sites and
eventually growing to thousands, and exhibiting a rich diversity of link
technologies, including some sites located at major routing and
co-location centers of the Internet. While most of these sites have
modest resources, several will exhibit substantial processing and
storage resources-including the site supported by this experimental
infrastructure proposal if funded. Preliminary versions of Tapestry
already run on PlanetLab.
On top of Tapestry, we will place a unified storage system that supports
incremental versioning and efficient archival storage through erasure
coding. At this stage, we will place a larger number of storage servers
throughout the network, thereby supporting the independent storage of
archival data fragments, in which only parts of files that have changed
are actually pushed out to the archive. Techniques can be
used to recognize which sub-blocks of files have changed.
At this stage of development, client-side software to support our
archival storage interfaces will be desirable. Once such interfaces
exist (for instance, User-level file system support through custom NFS
daemons), we can envision a smooth integration between the archival
components and the fast data cache (described in below). Here
information that was in active use could be transparently cached while
inactive data could reside only in remote storage servers. Also at this
stage, we can begin to experiment with active repair techniques that
monitor storage servers and, based on erasure-coded storage, reconstruct
information that has been partially lost on crashed servers.
Distributed Storage: Cross-Domain Cooperative Storage
At this point, it becomes possible to consider adding some architectural
elements from the SAHARA
"storage wide-area networking" project. This project aims to be able to
create end-to-end network services with desirable and predictable
properties, like performance and availability, when provisioned from
multiple independent service providers. To do so, SAHARA is developing
an architecture for future network services that supports the dynamic
confederation of sometimes collaborating and sometimes competing service
providers. Such an architecture would be helpful in the event that
individual storage servers are owned by more than one organization, but
combined as a whole to provide stable archival storage. It may be
possible to experiment with such organizations first within CITRIS, and
then, more broadly, through PlanetLab. As a basic first step, all
information would be encrypted for privacy and authenticated through
cryptographic techniques such as signatures and secure hashing. More
interestingly, SAHARA and Tapestry could interact in productive ways,
for example, by SAHARA providing Tapestry with suggestions for routing
based in its understanding of networking properties. We will
incorporate pieces of this vision as they stabilize and as time permits.
Also to be explored at this phase would be techniques for committing
information in the wide area through Byzantine agreement protocols. In
particular, we would explore the advantages of a hierarchy of trust:
Some servers would be more trusted to perform commitment of information
to the archive; however, even such trusted servers would be subject to
compromise, thereby dictating the need for Byzantine agreement
protocols. (Prototype Byzantine agreement protocols are already
functioning within the OceanStore prototype.)
These steps, if successful, provide for the possibility of constructing
a "survivable wide-area storage network", in which a disaster at some
number of otherwise unrelated sites will be recoverable from the still
functioning sites. In particular, there is a disaster recovery
application of the SAHARA project, which is described further in the
applications section below.
|