An eScience Data Cache

Large, scalable clusters have become the de facto standard for modeling and simulation in Computational Science and Computer Science. Our Millennium cluster of clusters today has five hundred processors in interconnected clusters ranging from 16 to 300 processors spread over eight departments. This facility is currently saturated by computational studies ranging from the next generation Internet to extreme-UV lithography, to identification of neutrino events at the south pole. Hence we are working with industrial collaborators, Intel and HP, to upgrade the core cluster to the upcoming McKinley IA64 generation to keep it computationally current.

The result of all this simulation and modeling is data-vast amounts of data. Cluster techniques can be used to bring down the cost of scalable storage, in much the same way as they do scalable processing. However, the several terabytes of storage provided in the clusters do not come close to supporting the data storage demand, and traditional file and DBMS methods are far from what is needed to support accessing and data management at this level. Note that many of these simulations are effectively eScience services, which perform the computation on behalf of a larger scientific community and project the data out onto the web. Thus we need to support the export of large data sets as well.

We building a vast networked storage facility integrated throughout the clusters that serves not only as a data store for mining and analysis performed within the cluster, but a cache for data at various stages in the scientific distillation process, including serving as a cache for data that is brought in from remote scientific repositories for processing within the cluster, and for export of results via the web. In this context, some of the more advanced aspects of the object store component of this proposal have some bearing: If we introduce client-side software for our interfaces, it should be possible to support smooth integration among primary stores, archival components and the fast data cache, making the fast cache a true cache of a permanent object store.

In any case, these stores are too large to be reasonably archived to tape, so we are exploring novel network redundancy techniques within and without the cluster complex.