This project is no longer active; this page is no longer updated.
Related projects: [CHARISMA], [Galley], [Parallel-I/O], [RAPID-Transit], [STARFISH]
Related keywords: [pario], [survey]
We developed the Armada parallel file system to allow a programmer more flexibility in specifying how data could flow from a set of I/O nodes to a set of computation nodes, in the context of large-scale computational grids. In these grids, network latency is significant, and it is important to pipeline the data flow. Armada allows the programmer to specify the data-transformation operators between the computation nodes and the I/O nodes, and internally optimizes the structure before automatically deploying the operators to intermediate nodes.
Develop an I/O framework that allows data-intensive applications to efficiently access geographically distributed data sets.
An exciting trend in high-performance computing is the use of geographically-distributed networks of heterogeneous systems and devices known as "computational grids". Of particular interest is the development of data-intensive grid applications. Data-intensive grid applications require access to large (terabyte-petabyte) remote data sets, and often require significant preprocessing of filtering before the computation can take place. Such applications exist in many areas of scientific computing including seismic processing, climate modeling, particle physics, and others
The Armada framework for parallel I/O provides a solution for data-intensive applications in which the application programmer and data-set provider deploy a network of application-specific and data-set-specific functionality across the grid. Using the Armada framework, grid applications access remote data by sending data requests through a graph of distributed application objects. A typical graph consists of two distinct portions: one that describes the layout of the data (as defined by the data provider), and one that describes the interface and processing required by the application. Armada appends the two portions to form a single graph, restructures the resulting graph to distribute computation and data flow, and assigns modules to grid processors so that data-reducing modules execute near the data source and data-increasing modules execute near the data destination. The figures below show the original and restructured graphs for an application accessing a replicated and distributed data set.
A key feature of Armada is the ability to restructure the application graph. Unlike existing database systems that restructure graphs based on well-known relational operators, Armada's restructuring algorithm uses programmer-defined properties to allow restructuring of a wide range of application classes (not just relational databases).
Perhaps the most critical part of the Armada system is the placement algorithm. Modules that make up the graph execute on processors near the client, processors near the data, or intermediate network processors. Our approach is to treat placement as a hierarchical graph-partitioning problem. We first partition the graph into domains in such a way to minimize the data transferred between domains. Next, we partition the modules in each domain to available processors provided by each domain's local resource manager.
Armada is not a fully-featured I/O system. In particular, it lacks support for data management, security, and fault tolerance, which are necessary in a production grid system. While there are active research groups working in each of these areas, there is still much debate over proper solutions. In the Conclusions and Future Work section of the dissertation (oldfield:thesis), we discuss the issues involved in building support for each of these features in the Armada I/O system. In addition, we present potential improvements to the placement algorithm, and describe how to adapt Armada to work efficiently on tightly-connected clusters of workstations.
The views and conclusions contained on this site and in its documents are those of the authors and should not be interpreted as necessarily representing the official position or policies, either expressed or implied, of the sponsor(s). Any mention of specific companies or products does not imply any endorsement by the authors or by the sponsor(s).
[Also available in BibTeX]
Papers are listed in reverse-chronological order.
Follow updates with RSS.