Key Applications

Digital Library Applications

Berkeley has a number of "digital library" efforts, the UC Berkeley Digital Library Project perhaps being most prominent among them. This project has a number of storage needs that can be quite considerable indeed, some of which we highlight here.

One key aspect of the DLP is an emphasis on access to items other than traditional text documents. Previously, we have developed some useful content-based image accessing technology, and more recently, have been experimenting with combining text and images. Some of these collections are large by current on-line image collection standards, but still rather modest. For example, we house the 80,000 image collection from the San Francisco Fine Arts Museum on our server, and use this, together with another 80,000 images, as the basis for various experimental efforts. However, we would like to operate on much large image collections, such as all the images available on the web. For example, one experiment we are attempting to conduct involves accessing all the photo-illustrated news stories available on the web, and using combined text-image methods to analyze these. For such experiments, we are limited by computing capabilities, which the Millennium project helps us address, and by storage considerations, which we propose to address here.

Another current DLP effort is to create database of "lexical signatures" of all the pages of the web. A lexical signature is a subset of document content which has the property of more or less uniquely identifying that document to a search engine, and which is somewhat robust to document change. Interestingly, lexical signatures of 5-10 words appear to be sufficient to do the job adequately. Lexical signatures can be attached to URLs to make robust hyperlinks, and which can be dereferenced when the URL fails by simply submitting the signature to a search engine. In practice, though, most URLs will not be "signed". Instead, we are building a database from a web crawl that will provide lexical signatures for "lost" web pages retroactively. Such a database will be much smaller than a web archive, but will grow with the number of pages present on the web. Hence, the bottleneck to providing a public "lost and found" service for looking up missing web pages is mostly a matter of storage.

Finally, there is a DLP effort to support so-called "Personal Libraries". The Personal Library (PL) is a set of services that allows users to store, organize and maintain network-accessible distributed document resources. Examples of such resources are on-line readers for courses, collections of documents related to some topic, collections of personal documents, technical report series, and collections of documents maintained by a working group.

The primary services provided by the PL are collection management service and repository service. Here, a collection is a set of documents, which in turn may comprise one or more (potentially distributed) resources (general, a fixation of a document in a given format). The collection manager lets users create, populate, maintain and search collections, among other things. Of course, the resources comprising documents in a collection have to live somewhere. If a resource already has a satisfactory network-accessible home, a collection may just point to that. However, if it doesn't, as may be the case for a paper document in a filing cabinet or an electronic document on a local disk, the repository server affiliated with our collection manager will provide storage for it. Since the repository service house documents, and the collection manager caches remotely housed documents, the storage demands grow with collection size. In addition, the PL includes a "scan-to-collection" service, which always users to create a cover page for a paper document, places that page on top of the paper document, and place it on a publicly available scanner, and press a button, whereupon the document will shortly end up in the named collection, stored in the affiliated repository, in a format suitable for viewing. So, it is quite easy to populate the repository with large scanned document images. We have recently made the PL available to our department members, on an experimental basis. Interestingly, students and faculty members continue to find new and unanticipated uses for the services. (This year, the EECS graduate admissions office plans to use it to completely automate the graduate student admission review process by scanning all the paper transcripts supplied.) We would like to encourage such creative uses. About the only cost to the project to do so is the associated unanticipated storage costs.

Metropolitan Area Freeway Traffic Estimation, Prediction And Control

Researchers at UC Berkeley have been working for some time on problems in sensing, state estimation, prediction and control in the context of prototypical metropolitan area freeway transportation system. Such a system is monitored by loop detectors, video cameras, electronic vehicle tags, and cellular phones. However, in the absence of computational technology to estimate system state and to predict system behavior, both transportation managers and travelers make decisions as if they were blindfolded.

Video data poses the severest algorithmic challenges but also offers considerable benefits by allowing the tracking of ordinary, unmarked vehicles over long distances. The UC Berkeley vision group has developed key algorithms in this area with considerable success in field trials. They have set up a test facility where multiple cameras mounted on top of a multistory building are used to monitor both directions of traffic on a 3 mile segment of freeway I-80 near Berkeley, and have recorded an enormous volume of data-literally thousands of videotapes, comprising about 1000 hours of recordings. The current protocol for using these data is to digitize small segments of it, and make that available online for research. By doing so, the research group has been able to address some problems, such as estimating models of drivers lane following behavior.

Unfortunately, research in this area is currently impeded because the project has no means of storing these data online. For example, research on problems such as incident detection is difficult to carry out without on-line storage. Incidents correspond to a particular category of system states or trajectories, e.g., an "injury accident'". Thus, one must be able to detect and classify incidents with sufficient reliability. The challenge is to learn characteristic signatures from historical incident databases, to be able to recognize them in real time-a very interesting and practical example of data mining. Since incidents are (fortunately) rare, such a determination requires that all the data be available online.

One hour of uncompressed color video is on the order of 100 GB, so storing a thousand hours requires ~100 terabytes. Indeed, the storage needs are arguably much greater than this, as collecting traffic data as been temporarily suspected, due to the fact that more data cannot currently be profitably exploited. Thus, this project represents a storage challenge that the current proposal will endeavor to address. Indeed, since the data are already "backed up" on tape, we are sure that the approach to be followed here can provide the required such storage at low cost.

Millennium

Here we highlight a number of Millenniumprojects whose transient storage requirements are particularly acute, and which underscore the necessity of the disk cache component of this proposal. While highlighting these applications dramatizes our needs, and provides specific examples of research advances that will be enabled by an adequate disk cache, it does not accurately reflect the full extent of the need, or of the upside of the proposed facility. This is because there is a much larger number of applications with smaller needs which need to be met simultaneously, but cannot be with the current Millennium .5TB disk cache, which as we noted, is currently 97% full, despite a rather draconian garbage collection policy. In addition, new Millennium applications arise daily, and thus, based on our experience, we anticipate many more projects with large transient storage requirements to arise.

Image Segmentation

For the past 12 months, the Berkeley Vision Group has been working on the problem of finding boundaries between objects in natural images. A dataset of ~1000 images, which have been segmented by human subjects, are processed by the Millennium cluster in order to explore the parameter spaces of various boundary detection algorithms. These algorithms can then be optimized using the computed information. In addition, the different algorithms can each be compared to the human ground truth.

The enormous computing resources available from the Millennium cluster have permitted this group to do research that simply could not be done otherwise. However, for this group's needs, by far the weakest link in the Millennium cluster-and perhaps in clusters in general-is the lack of a properly scaled durable storage layer. This is so because, while the size of the primary dataset of the group's calculations is not large, the intermediate data generated can be enormous. Evaluating a single algorithm at a single parameter setting requires generating a file that contains up to 30 features at each pixel in each image. This file is on the order of 1GB in size. Experiments routinely require such computations for thousands of parameter settings on dozens of algorithms.

For the sake of modularity, the group's tools communicate these huge data sets through the file system. The existing 1/2 TB shared file system on Millennium, while invaluable for this work, is often the bottleneck of their computation, especially when other users are accessing the file system.

In the near future, the group will be working on other computer vision problems of similar structure with even larger storage needs. They have work planned in the next couple of months for image segmentation, where features are computed for pixel pairs instead of for individual pixels. In addition, they are beginning work on video, where both long-term and short-term storage needs are vastly increased. Thus, a large capacity cluster-wide file system, plus a large capacity object store, are imperative for future progress in this area.

Antarctic Muon and Neutrino Detector Array (AMANDA)

AMANDA is a detector being constructed at the South Pole, whose purpose is to observe high-energy neutrinos from astrophysical point sources. Strings of widely spaced photomultiplier tubes (PMTs) are placed into deep water-drilled holes in the South Polar ice cap. High energy neutrinos coming up through the earth will occasionally interact with ice or rock and create a muon; such a muon emits Cherenkov light when passing through the array, and it can be tracked by measuring the arrival times of these Cherenkov photons at the PMTs.

The Millennium Cluster has been used by the AMANDA group to calculate several different functions. These include: Parameterization of java code that propagates muons through media, optimizing air shower propagation code, running dCORSIKA through a Monte Carlo chain, including mual propagation and detector simulation, using downgoing muons in the large AMANDA datasets to calibrate the detector's geometry and determine shifts due to ice flow.

Since several of the programs are Monte Carlo based, it is quite easy to break the big run in many smaller ones (usually one per node). Each smaller run is started with its own separate random generator seed. The programs create a large amount of intermediate data, which is then "cleaned" to reduce its size on the current Millennium architecture. The intermediate files (e.g., those created after a muon propagation step, which creates a lot of secondary point-like showers) can be quite large, often 1Gb per node. Thus, a large Millennium run essentially uses up all the disk cache we have (or, more likely, cannot be attempted at all, while any other applications are running. Once again, a large capacity cluster-wide file system is crucial for computations of this nature.

This project also has significant persistent object store needs, which we seek to address via the object store component of this project. Specifically, it is useful to store several years' worth of resulting data, which would require between 5 and 10 TB of object storage.

BErkeley Aerial Robot (BEAR)

The BEAR project is involved in designing optimal and collision-free trajectories for multiple autonomous aerial vehicles, computing optimal controller parameters for large scale systems, and identification of system dynamics from a large set of data They use the Millennium cluster for linear/nonlinear programming to optimize large dimensional parameters and for Monte-Carlo simulations of linear/nonlinear stochastic systems to test the performance of trajectories and controllers

Given the high demand and limited resources of the current Millennium storage capacity, they are unable to process some of their larger datasets. They are also not able to keep the data around in raw format for post-processing and debugging purposes, and future access.

In conjunction with the NEST (Network Embedded Systems Technology) project, this work has taken on a completely new element, where the aerial and ground-based robots are augmented with the ability to deposit fields of sensor nodes and interact with them over the network. This is introducing fine-grain distributed control into the regime of tiny wireless nodes. Complete empirical traces are obtained and used to develop algorithms that may eventually be deployed over the fine-grained network.

Smart Buildings

Ivy is a test environment that has as its principle goal providing a sensor network research infrastructure of fixed and mobile motes. Ivy implemented in a number of buildings on and off campus, in which a number of building-related experiments will be carried out. Our current list of these projects is as follows:

  • Energy efficient building operation: Experiments on the effective use of sensing motes to monitor the detailed operation of buildings, and of environmental control approaches using actuator motes added to or supplanting the building control systems. This project users sensors for temperature, humidity, air velocity, CO2, light, acoustics, power, occupancy, window switches, as well as actuators for electric power relays, status-indicating motes, audible and visible signals.
  • Demand responsive electricity management systems: Experiments with smart thermostats and electric meters to determine usability and effectiveness. This project users sensors for watts (whole-house and local), temperature, and occupancy; actuators are as above. It also entails the use of wireless devices for distributed computation.
  • Fire safety, disaster preparedness: Experiments with systems of sensors indicating the levels of smoke, noxious gases, temperature, and occupancy in buildings during simulated emergencies. The sensor information is processed and relayed to firefighters and emergency personnel in the building, along with location coordinates.
  • Structural integrity: Dense arrays of accelerometers attached to significant structural components are being deployed to detect changes in a building's structural characteristics following earthquakes, blasts, or other types of structural damage.

The Ivy test environment needs to be created at a scale that can test realistic usage, which in buildings will represent quite extensive networks of motes. Large mote numbers would be needed to determine whether UCB's (unique self-organizing) wireless network technology has to be extended or modified for it to reliably meet the requirements of our building applications. For energy and structural applications, we need to archive detailed building data over long (seasonal or annual) time periods in order to evaluate the buildings' performance. In addition, we need to archive data transfer information to determine the efficiency/reliability of the Ivy network itself, as a large number of sensors and transmitters are operated over long periods of time.

Datacenter Disaster Response

The SAHARA project plans to develop a datacenter disaster response application, which is defined as the low latency establishment of high speed connectivity to facilitate the rapid copying of huge data volumes to a remote (set of) site(s). In addition to the obvious requirements for high bandwidth and low latency from the underlying network, such an application demands the rapid identification of locations where storage resources are to found among storage service providers in the wide-area network, and an understanding of the geographical and topological diversity of those resources to give both a high confidence that the storage provider(s) will avoid the disaster and that non-interfering paths can be found to exploit parallelism in the data copying operation.

This project plans to define the disaster response application in more detail, and to design and prototype the relevant underlying application and network services as proofs of concept. These will span enhanced connectivity (e.g., fast methods to identify parallel and orthogonal network paths between the client site and the storage service instances in the network) and resource management techniques (e.g., selection of candidate storage service providers based not only on storage availability but also on dynamically determined end-to-end network bandwidth).

The plan is to investigate how active network components can support such applications. For example, the pre-existing trust relationships between client organizations and a particular storage service provider might be such that the latter would not be the best choice to receive the disaster copy given current bandwidth and latency considerations. Thus an alternative provider might be selected if the data to be stored can be encrypted on the fly using local processing resources. Other examples include automatically striping copy flows across multiple storage service provider instances to reduce overall latency, and the controlled introduction of redundancy, such as RAID-style parity to such wide-area copies. The proposed storage infrastructure will provide a wide-area storage system over which such an application can be constructed and tested.

The Electronic Cultural Atlas Initiative

The Electronic Cultural Atlas Initiative (ECAI) is a global collaboration supporting projects that combine global mapping, imagery, and texts. ECAI provides scholars and other users with a digital research resource that supports the presentation of complex combinations of data from multiple disciplines visually and immediately. ECAI has nearly 800 affiliates and over 120 projects around the world; they are currently holding their 12th annual meeting in Osaka.

ECAI has already proven to be an invaluable resource to humanities researchers. For example, consider the ECAI Silk Road Atlas, prominent among the many ECAI projects. The Silk Road is a famous network of transportation routes, but it is also a means of understanding and illustrating the ways that commodities, empires, religions, and the arts have traveled throughout Eurasia for thousands of years. Understanding the historical significance of the Silk Road leads one to appreciate that there is not a divide between "West" and "East" so much as an ongoing historical exchange of human experience.

The ECAI Silk Road Atlas is a series of interactive maps that make this point. Users can see change over time, link from places on the map to associated web resources, and explore maps of different scale, from a single temple to the entire earth. Via the use of technology, the sure significance of the Silk Road is vividly rendered to scholars and students alike

ECAI is an ambitious effort, intending as it does to provide access to all the world's geo-referencable cultural material.

ECAI members link their data using a metadata clearing house at the University of California at Berkeley. Therefore, it is at this point that substantial leverage can be gained from an injection of resources. For example, as one part of this venture, ECAI and the California Digital Library (CDL) are providing publication of internet material for multimedia datasets. This involves two parts:

  1. A fully peer reviewed finished publication for which the CDL promises to preserve the raw data over time.
  2. A series featuring "works in progress" which puts up on the net a collection of projects that are being created by the 800 affiliates. This will be in the form of a report of activities, samples of data being reviewed and examples of interactive use of the data.

At the moment, ECAI does not have space to handle such "works in progress". Not surprisingly, the amount of storage required is sizable, since these works encompass large numbers of scanned images, map spaces, and interactive components. In particular, many researchers contributing to these projects, especially in Africa, Central Asia, and Latin America, are particular resource limited. Also, since the work is so highly collaborative, it is hard to justify supporting its individual components, yet scholars worldwide benefit from these efforts, making it a prime beneficial of low-cost storage services.

Similarly, ECAI would like to establish a mirror site for large data sets access to which is currently hampered because they are poorly connected to the network. An initial version of such a mirror would require hundreds of gigabytes. The demand for cycles for mirroring can be quite significant, as any number of sites registered in the ECAI metadata Clearinghouse are large and threatening to grow. For example, mirroring the Arts and Humanities Data Service would require terabytes of storage; the Academia Sinica is discussing a petabyte store for the National Archives.

In summary, with more storage, ECAI can serve a large international community in a better fashion, can give aid to those affiliates who lack server space at home, and can mirror site for more rapid access by the US and Internet II communities.