Key Applications
Digital Library Applications
Berkeley has a number of "digital library" efforts, the UC Berkeley Digital Library Project
perhaps being most prominent among them. This project has a number of
storage needs that can be quite considerable indeed, some of which we
highlight here.
One key aspect of the DLP is an emphasis on access to items other than
traditional text documents. Previously, we have developed some useful
content-based image
accessing technology, and more recently, have been experimenting
with combining
text and images. Some of these collections are large by current
on-line image collection standards, but still rather modest. For
example, we house the 80,000 image collection from the San Francisco
Fine Arts Museum on our server, and use this, together with another
80,000 images, as the basis for various experimental efforts. However,
we would like to operate on much large image collections, such as all
the images available on the web. For example, one experiment we are
attempting to conduct involves accessing all the photo-illustrated news
stories available on the web, and using combined text-image methods to
analyze these. For such experiments, we are limited by computing
capabilities, which the Millennium project helps us address, and by
storage considerations, which we propose to address here.
Another current DLP effort is to create database of "lexical signatures"
of all the pages of the web. A lexical signature is a subset of
document content which has the property of more or less uniquely
identifying that document to a search engine, and which is somewhat
robust to document change. Interestingly, lexical signatures of 5-10
words appear to be sufficient to do the job adequately. Lexical
signatures can be attached to URLs to make robust
hyperlinks, and which can be dereferenced when the URL fails by
simply submitting the signature to a search engine. In practice,
though, most URLs will not be "signed". Instead, we are building a
database from a web crawl that will provide lexical signatures for
"lost" web pages retroactively. Such a database will be much smaller
than a web archive, but will grow with the number of pages present on
the web. Hence, the bottleneck to providing a public "lost and found"
service for looking up missing web pages is mostly a matter of storage.
Finally, there is a DLP effort to support so-called "Personal
Libraries". The Personal Library (PL) is a set of services that allows
users to store, organize and maintain network-accessible distributed
document resources. Examples of such resources are on-line readers for
courses, collections of documents related to some topic, collections of
personal documents, technical report series, and collections of
documents maintained by a working group.
The primary services provided by the PL are collection management
service and repository service. Here, a collection is a set of
documents, which in turn may comprise one or more (potentially
distributed) resources (general, a fixation of a document in a given
format). The collection manager lets users create, populate, maintain
and search collections, among other things. Of course, the resources
comprising documents in a collection have to live somewhere. If a
resource already has a satisfactory network-accessible home, a
collection may just point to that. However, if it doesn't, as may be the
case for a paper document in a filing cabinet or an electronic document
on a local disk, the repository server affiliated with our collection
manager will provide storage for it. Since the repository service house
documents, and the collection manager caches remotely housed documents,
the storage demands grow with collection size. In addition, the PL
includes a "scan-to-collection" service, which always users to create a
cover page for a paper document, places that page on top of the paper
document, and place it on a publicly available scanner, and press a
button, whereupon the document will shortly end up in the named
collection, stored in the affiliated repository, in a format suitable
for viewing. So, it is quite easy to populate the repository with large
scanned document images. We have recently made the PL available to our
department members, on an experimental basis. Interestingly, students
and faculty members continue to find new and unanticipated uses for the
services. (This year, the EECS graduate admissions office plans to use
it to completely automate the graduate student admission review process
by scanning all the paper transcripts supplied.) We would like to
encourage such creative uses. About the only cost to the project to do
so is the associated unanticipated storage costs.
Metropolitan Area Freeway Traffic Estimation, Prediction And
Control
Researchers at UC Berkeley have been working for some time on problems
in sensing, state estimation, prediction and control in the context of
prototypical metropolitan
area freeway transportation system. Such a system is monitored by
loop detectors, video cameras, electronic vehicle tags, and cellular
phones. However, in the absence of computational technology to estimate
system state and to predict system behavior, both transportation
managers and travelers make decisions as if they were blindfolded.
Video data poses the severest algorithmic challenges but also offers
considerable benefits by allowing the tracking of ordinary, unmarked
vehicles over long distances. The UC Berkeley vision group has
developed key algorithms in this area with considerable success in field
trials. They have set up a test facility where multiple cameras mounted
on top of a multistory building are used to monitor both directions of
traffic on a 3 mile segment of freeway I-80 near Berkeley, and have
recorded an enormous volume of data-literally thousands of videotapes,
comprising about 1000 hours of recordings. The current protocol for
using these data is to digitize small segments of it, and make that
available online for research. By doing so, the research group has been
able to address some problems, such as estimating models of drivers lane
following behavior.
Unfortunately, research in this area is currently impeded because the
project has no means of storing these data online. For example,
research on problems such as incident detection is difficult to carry
out without on-line storage. Incidents correspond to a particular
category of system states or trajectories, e.g., an "injury accident'".
Thus, one must be able to detect and classify incidents with sufficient
reliability. The challenge is to learn characteristic signatures from
historical incident databases, to be able to recognize them in real
time-a very interesting and practical example of data mining. Since
incidents are (fortunately) rare, such a determination requires that all
the data be available online.
One hour of uncompressed color video is on the order of 100 GB, so
storing a thousand hours requires ~100 terabytes. Indeed, the storage
needs are arguably much greater than this, as collecting traffic data as
been temporarily suspected, due to the fact that more data cannot
currently be profitably exploited. Thus, this project represents a
storage challenge that the current proposal will endeavor to address.
Indeed, since the data are already "backed up" on tape, we are sure that
the approach to be followed here can provide the required such storage
at low cost.
Millennium
Here we highlight a number of Millenniumprojects whose
transient storage requirements are particularly acute, and which
underscore the necessity of the disk cache component of this proposal.
While highlighting these applications dramatizes our needs, and provides
specific examples of research advances that will be enabled by an
adequate disk cache, it does not accurately reflect the full extent of
the need, or of the upside of the proposed facility. This is because
there is a much larger number of applications with smaller needs which
need to be met simultaneously, but cannot be with the current Millennium
.5TB disk cache, which as we noted, is currently 97% full, despite a
rather draconian garbage collection policy. In addition, new Millennium
applications arise daily, and thus, based on our experience, we
anticipate many more projects with large transient storage requirements
to arise.
Image Segmentation
For the past 12 months, the Berkeley Vision Group has been working on
the problem of finding
boundaries between objects in natural images. A dataset of ~1000
images, which have been segmented by human subjects, are processed by
the Millennium cluster in order to explore the parameter spaces of
various boundary detection algorithms. These algorithms can then be
optimized using the computed information. In addition, the different
algorithms can each be compared to the human ground truth.
The enormous computing resources available from the Millennium cluster have permitted this group to do research that simply could not be done otherwise. However, for this group's needs, by far the weakest link in the Millennium cluster-and perhaps in clusters in general-is the lack of a properly scaled durable storage layer. This is so because, while the size of the primary dataset of the group's calculations is not large, the intermediate data generated can be enormous. Evaluating a single algorithm at a single parameter setting requires generating a file that contains up to 30 features at each pixel in each image. This file is on the order of 1GB in size. Experiments routinely require such computations for thousands of parameter settings on dozens of algorithms.
For the sake of modularity, the group's tools communicate these huge data sets through the file system. The existing 1/2 TB shared file system on Millennium, while invaluable for this work, is often the bottleneck of their computation, especially when other users are accessing the file system.
In the near future, the group will be working on other computer vision problems of similar structure with even larger storage needs. They have work planned in the next couple of months for image segmentation, where features are computed for pixel pairs instead of for individual pixels. In addition, they are beginning work on video, where both long-term and short-term storage needs are vastly increased. Thus, a large capacity cluster-wide file system, plus a large capacity object store, are imperative for future progress in this area.
Antarctic Muon and Neutrino Detector Array (AMANDA)
AMANDA is a detector being
constructed at the South Pole, whose purpose is to observe high-energy
neutrinos from astrophysical point sources. Strings of widely spaced
photomultiplier tubes (PMTs) are placed into deep water-drilled holes in
the South Polar ice cap. High energy neutrinos coming up through the
earth will occasionally interact with ice or rock and create a muon;
such a muon emits Cherenkov light when passing through the array, and it
can be tracked by measuring the arrival times of these Cherenkov photons
at the PMTs.
The Millennium Cluster has been used by the AMANDA group to calculate
several different functions. These include: Parameterization of java
code that propagates muons through media, optimizing air shower
propagation code, running dCORSIKA through a Monte Carlo chain,
including mual propagation and detector simulation, using downgoing
muons in the large AMANDA datasets to calibrate the detector's geometry
and determine shifts due to ice flow.
Since several of the programs are Monte Carlo based, it is quite easy to
break the big run in many smaller ones (usually one per node). Each
smaller run is started with its own separate random generator seed. The
programs create a large amount of intermediate data, which is then
"cleaned" to reduce its size on the current Millennium architecture.
The intermediate files (e.g., those created after a muon propagation
step, which creates a lot of secondary point-like showers) can be quite
large, often 1Gb per node. Thus, a large Millennium run essentially
uses up all the disk cache we have (or, more likely, cannot be attempted
at all, while any other applications are running. Once again, a large
capacity cluster-wide file system is crucial for computations of this
nature.
This project also has significant persistent object store needs, which we seek to address via the object store component of this project. Specifically, it is useful to store several years' worth of resulting data, which would require between 5 and 10 TB of object storage.
BErkeley Aerial Robot (BEAR)
The BEAR project
is involved in designing optimal and collision-free trajectories for
multiple autonomous aerial vehicles, computing optimal controller
parameters for large scale systems, and identification of system
dynamics from a large set of data They use the Millennium cluster for
linear/nonlinear programming to optimize large dimensional parameters
and for Monte-Carlo simulations of linear/nonlinear stochastic systems
to test the performance of trajectories and controllers
Given the high demand and limited resources of the current Millennium
storage capacity, they are unable to process some of their larger
datasets. They are also not able to keep the data around in raw format
for post-processing and debugging purposes, and future access.
In conjunction with the NEST
(Network Embedded Systems Technology) project, this work has taken on a
completely new element, where the aerial and ground-based robots are
augmented with the ability to deposit fields of sensor nodes and
interact with them over the network. This is introducing fine-grain
distributed control into the regime of tiny wireless nodes. Complete
empirical traces are obtained and used to develop algorithms that may
eventually be deployed over the fine-grained network.
Smart Buildings
Ivy
is a test environment that has as its principle goal providing a sensor
network research infrastructure of fixed and mobile motes. Ivy
implemented in a number of buildings on and off campus, in which a
number of building-related experiments will be carried out. Our current
list of these projects is as follows:
-
Energy efficient building operation: Experiments on the effective use of
sensing motes to monitor the detailed operation of buildings, and of
environmental control approaches using actuator motes added to or
supplanting the building control systems. This project users sensors
for temperature, humidity, air velocity, CO2, light, acoustics, power,
occupancy, window switches, as well as actuators for electric power
relays, status-indicating motes, audible and visible signals.
-
Demand responsive electricity management systems: Experiments with smart
thermostats and electric meters to determine usability and
effectiveness. This project users sensors for watts (whole-house and
local), temperature, and occupancy; actuators are as above. It also
entails the use of wireless devices for distributed computation.
-
Fire safety, disaster preparedness: Experiments with systems of sensors
indicating the levels of smoke, noxious gases, temperature, and
occupancy in buildings during simulated emergencies. The sensor
information is processed and relayed to firefighters and emergency
personnel in the building, along with location coordinates.
-
Structural integrity: Dense arrays of accelerometers attached to
significant structural components are being deployed to detect changes
in a building's structural characteristics following earthquakes,
blasts, or other types of structural damage.
The Ivy test environment needs to be created at a scale that can test
realistic usage, which in buildings will represent quite extensive
networks of motes. Large mote numbers would be needed to determine
whether UCB's (unique self-organizing) wireless network technology has
to be extended or modified for it to reliably meet the requirements of
our building applications.
For energy and structural applications, we need to archive detailed
building data over long (seasonal or annual) time periods in order to
evaluate the buildings' performance. In addition, we need to archive
data transfer information to determine the efficiency/reliability of the
Ivy network itself, as a large number of sensors and transmitters are
operated over long periods of time.
Datacenter Disaster Response
The SAHARA project plans to
develop a datacenter disaster response application, which is defined as
the low latency establishment of high speed connectivity to facilitate
the rapid copying of huge data volumes to a remote (set of) site(s). In
addition to the obvious requirements for high bandwidth and low latency
from the underlying network, such an application demands the rapid
identification of locations where storage resources are to found among
storage service providers in the wide-area network, and an understanding
of the geographical and topological diversity of those resources to give
both a high confidence that the storage provider(s) will avoid the
disaster and that non-interfering paths can be found to exploit
parallelism in the data copying operation.
This project plans to define the disaster response application in more
detail, and to design and prototype the relevant underlying application
and network services as proofs of concept. These will span enhanced
connectivity (e.g., fast methods to identify parallel and orthogonal
network paths between the client site and the storage service instances
in the network) and resource management techniques (e.g., selection of
candidate storage service providers based not only on storage
availability but also on dynamically determined end-to-end network
bandwidth).
The plan is to investigate how active network components can support
such applications. For example, the pre-existing trust relationships
between client organizations and a particular storage service provider
might be such that the latter would not be the best choice to receive
the disaster copy given current bandwidth and latency considerations.
Thus an alternative provider might be selected if the data to be stored
can be encrypted on the fly using local processing resources. Other
examples include automatically striping copy flows across multiple
storage service provider instances to reduce overall latency, and the
controlled introduction of redundancy, such as RAID-style parity to such
wide-area copies. The proposed storage infrastructure will provide a
wide-area storage system over which such an application can be
constructed and tested.
The Electronic Cultural Atlas Initiative
The Electronic Cultural Atlas Initiative (ECAI) is a global collaboration
supporting projects that combine global mapping, imagery, and texts.
ECAI provides scholars and other users with a digital research resource
that supports the presentation of complex combinations of data from
multiple disciplines visually and immediately. ECAI has nearly 800
affiliates and over 120 projects around the world; they are currently
holding their 12th annual meeting in Osaka.
ECAI has already proven to be an invaluable resource to humanities
researchers. For example, consider the ECAI Silk Road Atlas, prominent
among the many ECAI projects. The Silk Road is a famous network of
transportation routes, but it is also a means of understanding and
illustrating the ways that commodities, empires, religions, and the arts
have traveled throughout Eurasia for thousands of years. Understanding
the historical significance of the Silk Road leads one to appreciate
that there is not a divide between "West" and "East" so much as an
ongoing historical exchange of human experience.
The ECAI Silk Road Atlas is a series of interactive maps that make this
point. Users can see change over time, link from places on the map to
associated web resources, and explore maps of different scale, from a
single temple to the entire earth. Via the use of technology, the sure
significance of the Silk Road is vividly rendered to scholars and
students alike
ECAI is an ambitious effort, intending as it does to provide access to all the world's geo-referencable cultural material.
ECAI members link their data using a metadata clearing house at the
University of California at Berkeley. Therefore, it is at this point
that substantial leverage can be gained from an injection of resources.
For example, as one part of this venture, ECAI and the California
Digital Library (CDL) are providing publication of internet material for
multimedia datasets. This involves two parts:
- A fully peer reviewed finished publication for which the CDL
promises to preserve the raw data over time.
- A series featuring "works in progress" which puts up on the net a
collection of projects that are being created by the 800 affiliates.
This will be in the form of a report of activities, samples of data
being reviewed and examples of interactive use of the data.
At the moment, ECAI does not have space to handle such "works in
progress". Not surprisingly, the amount of storage required is sizable,
since these works encompass large numbers of scanned images, map spaces,
and interactive components. In particular, many researchers
contributing to these projects, especially in Africa, Central Asia, and
Latin America, are particular resource limited. Also, since the work is
so highly collaborative, it is hard to justify supporting its individual
components, yet scholars worldwide benefit from these efforts, making it
a prime beneficial of low-cost storage services.
Similarly, ECAI would like to establish a mirror site for large data sets access to which is currently hampered because they are poorly connected to the network. An initial version of such a mirror would require hundreds of gigabytes. The demand for cycles for mirroring can be quite significant, as any number of sites registered in the ECAI metadata Clearinghouse are large and threatening to grow. For example, mirroring the Arts and Humanities Data Service would require terabytes of storage; the Academia Sinica is discussing a petabyte store for the National Archives.
In summary, with more storage, ECAI can serve a large international community in a better fashion, can give aid to those affiliates who lack server space at home, and can mirror site for more rapid access by the US and Internet II communities.
|